Runbooks

Runbooks are the reviewed operator path for repeated maintenance, recovery, and high-signal proof work. They should stay short enough to follow under pressure, but explicit about approval gates, mutation scope, and evidence.

The CLI command surface is documented in the Wavemap CLI Command Reference. This page explains when and how the deployed-dev operations are used. Use the Deploy Dev and Smoke command reference sections for command ownership, wrapper paths, compatibility scripts, flags, and default mutation posture.

Deployed-Dev Boundary

These runbooks apply only to the shared deployed dev environment:

Boundary	Current Value
Public app URL	`https://dev.wavemap.app`
Pulumi stack	`aws-dev`
Runtime region	`us-east-2`
Runtime target	Sleepable EC2 host running Docker Compose
Runtime config store	AWS SSM Parameter Store under `/wavemap/dev/runtime/*`
Last-good receipt	`/opt/wavemap/deployment/last-successful-runtime-release.json`

Durable examples use the public CLI form:

pnpm wavemap -- <command> [flags...]

Root scripts such as pnpm deploy:dev:runtime and pnpm smoke:dev remain compatibility aliases. Use them when a local note or workflow already does, but prefer public wavemap routes in new runbooks.

Examples that need live Pulumi stack outputs use this placeholder path:

/tmp/wavemap-aws-dev-outputs.json

Capture or supply that file through the approved deploy workflow, cloud-plan job, or local operator process before running commands that need live cloud facts. Do not paste secret values into the file or into runbook notes.

Shared Operator Gates

Before running a live command, confirm:

The target is deployed dev, not local Docker, staging, or production.
The selected checkout/ref is the one intended for deployment or verification.
Any live command has been dry-run first when the command supports a dry-run posture.
The command’s mutation gate is explicit, usually --execute.
Secrets are not printed, copied into workflow summaries, or stored in GitHub when the runtime host should read them from SSM.
The expected evidence is known before starting: SSM command ID, GitHub job summary, smoke result, discrepancy counts, or failure artifact.

Operator Store Quick Reference

Use this table before deciding which credential, artifact, or workflow path to reach for. The durable store and access boundary is explained in Infrastructure Change Policy.

Task	Normal Input	Actor	Gate
Routine app/API deploy	Current deployment contract plus app deploy role.	`.github/workflows/deploy-dev.yml`.	App deploy contract-read grant and refreshed contract must be live. No routine Pulumi token.
Routine docs deploy	Current deployment contract plus docs deploy role.	`.github/workflows/deploy-docs.yml`.	Docs role reads the private contract store. No routine Pulumi token.
Infrastructure mutation	Pulumi backend, Pulumi config, and infra operator credentials.	Local infra operator, or future explicitly approved infra workflow.	Preview summary, human approval, and runbook-specific post-apply evidence.
Manual infra-topology ingest	S3 Pulumi backend URL, backend region, infra-topology OIDC role, passphrase.	`.github/workflows/infra-topology-ingest.yml` or local operator capture.	Self-managed backend migration and narrow backend-read role must be proven before retiring Pulumi Cloud.
Runtime secret value change	SSM Parameter Store path under `/wavemap/dev/runtime/*`.	Runtime config population command or approved operator process.	GitHub should carry references or bootstrap values, not decrypted runtime secrets.
Deployment contract publication	Reviewed Pulumi outputs projected into `wavemap.deployment-contract` v1.	Approved contract publisher.	Future writer permission still needs least-privilege review before receiving artifact-store write access.
Public topology figure publication	Reviewed sanitized topology projection and figure ledger decision.	Human docs edit after private capture/projection review.	Raw captures and private generated candidates stay outside public docs.

Pulumi State Backend Migration

Use this runbook to move the aws-dev Pulumi stack from Pulumi Cloud to the self-managed private S3 backend. This is an infra-operator operation, not routine app/API or docs CD.

Target backend:

s3://wavemap-dev-pulumi-state-959516292206-us-east-2/pulumi-state

Before approval:

Confirm no pulumi up, docs/app deploy, runtime replacement, media bucket replacement, or topology capture is in progress.
Confirm the checked-out repo includes the modeled provider.aws.pulumiStateBackend bucket resource.
Confirm the infra operator AWS identity can create and later read/write the state bucket and objects under pulumi-state/.pulumi/*.
Create or retrieve the aws-dev Pulumi passphrase in the operator secret store.
Decide where the plaintext migration export will live temporarily and where an encrypted backup will be stored after import.

Apply the backend bucket only after explicit approval:

pnpm -C infra/pulumi run preview
pnpm -C infra/pulumi run up -- --yes

Point the current shell at the passphrase file:

export PULUMI_CONFIG_PASSPHRASE_FILE="$HOME/.config/wavemap/pulumi/aws-dev.passphrase"

Move stack secrets to the passphrase provider while still on the old backend:

pnpm -C infra/pulumi exec pulumi stack change-secrets-provider passphrase --stack aws-dev

Review the Pulumi.aws-dev.yaml diff before continuing.

Create a strict-permission migration export:

umask 077
export WAVEMAP_PULUMI_MIGRATION_BACKUP_DIR="/private/tmp/wavemap-pulumi-state-migration-$(date -u +%Y%m%dT%H%M%SZ)"
mkdir -p "$WAVEMAP_PULUMI_MIGRATION_BACKUP_DIR"
pnpm -C infra/pulumi exec pulumi stack export \
  --stack aws-dev \
  --show-secrets \
  --file "$WAVEMAP_PULUMI_MIGRATION_BACKUP_DIR/aws-dev.stack.json"

Treat this export as secret-bearing plaintext. Do not upload it to GitHub artifacts, docs, chat, or shared storage.

Import into the self-managed backend:

export WAVEMAP_PULUMI_STATE_BACKEND_URL="s3://wavemap-dev-pulumi-state-959516292206-us-east-2/pulumi-state"
pnpm -C infra/pulumi exec pulumi logout
pnpm -C infra/pulumi exec pulumi login "$WAVEMAP_PULUMI_STATE_BACKEND_URL"
pnpm -C infra/pulumi exec pulumi stack init aws-dev --secrets-provider passphrase
pnpm -C infra/pulumi exec pulumi stack import \
  --stack aws-dev \
  --file "$WAVEMAP_PULUMI_MIGRATION_BACKUP_DIR/aws-dev.stack.json"

Verify before retiring Pulumi Cloud assumptions:

pnpm -C infra/pulumi exec pulumi whoami --verbose
pnpm -C infra/pulumi exec pulumi preview --stack aws-dev
pnpm -C infra/pulumi exec pulumi stack output --stack aws-dev --json > /tmp/wavemap-aws-dev-outputs.json
aws s3 ls s3://wavemap-dev-pulumi-state-959516292206-us-east-2/pulumi-state/.pulumi/stacks/

Expected evidence:

Preview/apply evidence for the protected state backend bucket.
A pulumi whoami --verbose result showing the S3 backend URL.
A no-surprise pulumi preview --stack aws-dev from the S3 backend.
A fresh stack output capture from the S3 backend.
A reviewed plan for preserving an encrypted stack backup and deleting the local plaintext migration export.

Recovery posture:

Before S3 import succeeds, the old Pulumi Cloud stack remains the recovery source.
After the passphrase provider change, the passphrase is required even when operating temporarily from the old backend.
After import succeeds, S3 object versioning and Pulumi history help recover from backend object drift, but they do not replace a deliberate encrypted stack-export backup.

After the migrated backend is proven, update the GitHub dev environment for manual infra-topology ingest:

Variable: PULUMI_BACKEND_URL=s3://wavemap-dev-pulumi-state-959516292206-us-east-2/pulumi-state.
Variable: PULUMI_STATE_BACKEND_AWS_REGION=us-east-2.
Secret: INFRA_TOPOLOGY_AWS_ROLE_TO_ASSUME, scoped to the infra-topology state backend role.
Secret: PULUMI_CONFIG_PASSPHRASE, matching the aws-dev passphrase provider.

Then remove PULUMI_ACCESS_TOKEN from the GitHub dev environment and revoke the Pulumi Cloud token if it is no longer needed outside Wavemap. This is a live credential cleanup step and should be done deliberately, after the S3 backend preview and topology ingest path have both passed.

GitHub Workflow Dispatch Recipe Selection

Use this runbook when manually dispatching .github/workflows/deploy-dev.yml for deployed dev.

The dispatch posture is recipe-first. Start with a named run_profile, then add explicit add-ons only when the proof you need is outside that profile. Treat custom and raw stage toggles as deliberate exceptions, not the normal operator interface.

Before dispatch:

Confirm the selected git_ref is the branch or SHA intended for this proof.
Confirm whether the run should be non-mutating, app-runtime mutating, data-destructive, media-mutating, or lifecycle-disruptive.
Confirm a heavier profile is worth its cost, runtime disruption, and artifact noise.
Decide what evidence will make the run useful before starting it.

Profile selection:

Profile	Use When	Boundary
`preflight`	Checking repo-local deploy contracts through GitHub Actions.	No cloud authentication and no live mutation.
`cloud-plan`	Checking the deployment contract, GitHub OIDC, live SSM metadata, and deploy dry-runs.	Cloud-authenticated but non-mutating.
`deploy-endpoint`	Normal deployed-dev app/API deploy proof.	Builds/pushes images, deploys the app runtime, gates endpoint smoke on migration status.
`deploy-seeded-browser`	Proving the seeded route and browser basics after app or data-shape changes.	Includes destructive database reset before seeded and browser smoke.
`deploy-media`	Proving API, S3, public media URL, CloudFront media delivery, browser rendering, and drift counts.	Includes destructive reset, temporary media mutation, browser media proof, and DB/S3 report.
`deploy-lifecycle`	Proving shutdown, cold-start page, wake, and browser reload behavior.	Includes destructive reset and deliberately stops the runtime host for recovery proof.
`custom`	One-off debugging or an intentionally unusual stage combination.	Operator owns the full prerequisite and evidence story.

Automatic develop deploys use the conservative deploy-endpoint profile and do not read manual dispatch inputs. That profile now includes the migration status gate and migration-only repair before endpoint smoke; destructive reset and seeded data remain off.

Optional add-ons:

Add-On	Pair With	Use When
`run_docker_preflight=true`	Any profile	You want slower Docker build verification during preflight.
`validate_cloud_env_contract=true`	Any profile	You want GitHub-owned bootstrap values checked before a cloud job needs them.
`run_db_migrate=false`	`custom` deploy runs	You want the migration status gate to fail instead of applying pending migrations.
`run_wake_smoke=true`	`deploy-endpoint` or any deploy profile	You want the cheap wake-path endpoint proof after ordinary endpoint smoke.
`run_browser_routing_smoke=true`	`deploy-endpoint` or any deploy profile	You want non-destructive Chromium login/routing proof before any database reset.
`run_cold_start_browser_smoke=true`	`deploy-media` or `custom` with seeded prerequisites	You want lifecycle proof in addition to media proof, or a one-off stopped-host recovery check.
`run_media_smoke=true` and `run_browser_media_smoke=true`	`deploy-lifecycle` or `custom` with seeded prerequisites	You want media proof in addition to lifecycle proof.

Keep deploy-media and deploy-lifecycle separate for normal use. A composite proof is allowed through explicit add-ons, but it should stay deliberate because media proof mutates app media and lifecycle proof deliberately stops the runtime host.

Expected evidence:

Workflow run number or URL.
Selected ref, resolved commit SHA, selected profile, and selected add-ons.
Resolved stage summary.
SSM command IDs for runtime deploy, database status/migrate, reset, release-record, rollback, or discrepancy-report stages when they run.
Smoke results, elapsed times, and artifact names for any browser, media, or lifecycle failures.
A short note explaining operator intent when custom is used.

Use local commands for focused manual repair, dry-run planning, or when a workflow job’s summary points at a specific operator action.

Runtime And Data Lifecycle Quick Reference

The deployed-dev runtime is cost-first and disposable, but different operations have different data consequences. Use Deployed Dev Lifecycle for the full lifecycle matrix, teardown gradations, and expected evidence. Use Data Durability And Recovery for the current disposable-data posture, backup learning-drill boundary, and future recovery gates. Use Media Workflow And Validation when choosing between media smoke, browser media smoke, and discrepancy reporting.

Operation	Runtime Behavior	Database Behavior	Media Behavior	Operator Gate
Runtime deploy	Pulls selected backend/frontend images and restarts Compose on the current host.	Preserved unless the deploy also runs an explicit reset or migration path.	Unchanged.	`runtime deploy --execute` after images and runtime config are ready.
Host stop or automatic inactivity shutdown	Stops the EC2 instance; root EBS remains attached.	Preserved across stop/start.	Unchanged.	Shutdown Lambda or approved lifecycle proof.
Runtime rollback	Redeploys the last-good app image pair through the runtime deploy document.	Not rolled back.	Not rolled back.	`runtime rollback --execute`, followed by endpoint smoke.
Database status	Runs a read-only migration ledger check in the deployed API container.	Read-only.	Unchanged.	`database status`, before migration repair or deploy schema gates.
Database migrate	Runs pending Drizzle migrations in the deployed API container.	Schema mutation only; existing rows should be preserved.	Unchanged.	`database migrate --execute`, followed by status and endpoint smoke.
Database reset	Runs migrate, base seed, and deterministic dev-data seed in the deployed API container.	Destructive; non-seed rows can be deleted.	Unchanged; objects can become application-orphaned.	`database reset --execute`, followed by seeded smoke.
Media discrepancy report	Runs a read-only DB/S3 comparison in the deployed API container.	Read-only.	Read-only.	`media discrepancy-report --execute` for live SSM execution.
EC2/runtime replacement	Replaces or destroys the host through infrastructure change.	At risk unless a separate backup, snapshot, or migration runbook is used.	Unchanged unless media infrastructure also changes.	Runtime host replacement after approval.
Media bucket replacement or destroy	Not a runtime-host action.	Rows may reference missing media after replacement/delete.	At risk.	Media bucket replacement after approval.

Host stop is cost control. It is not database cleanup, media cleanup, backup, rollback, or deploy-state mutation.

Host Stop And Wake Recovery

Use this runbook when the deployed-dev runtime host is stopped, may be stopped, or needs a deliberate stopped-host recovery proof.

Host stop is a shared-environment disruption. Prefer the deploy-lifecycle workflow profile when the goal is a normal stopped-host proof, because the workflow captures shutdown, cold-start, browser, and smoke evidence in one place. Use the manual checks below for focused recovery, control-plane debugging, or when an automatic inactivity stop has already occurred.

Before deliberate host stop:

Confirm the target is https://dev.wavemap.app, Pulumi stack aws-dev, and runtime region us-east-2.
Confirm no deploy, reset, media proof, or demo is in progress.
Confirm whether endpoint wake recovery is enough, or whether the browser must return to its original destination.
Confirm the seeded baseline is valid before choosing browser cold-start recovery.
Confirm the stop is only cost-control or lifecycle proof. Do not combine it with database reset, media cleanup, rollback, backup, replacement, or stack teardown without a separate operator decision.

Deliberate host stop is owned by the shutdown Lambda. It has no public app route. This is a live cloud mutation and needs explicit operator approval at run time. If a local operator invokes it outside the deploy-lifecycle workflow, derive the shutdown function name from the approved deployment contract’s resource prefix or another reviewed infra output, then use an approved AWS operator identity:

aws lambda invoke \
  --region us-east-2 \
  --function-name "<shutdownFunctionName>" \
  /tmp/wavemap-dev-shutdown-response.json

After stopping the host, confirm the app URL serves the cold-start page rather than a raw CloudFront, connection, or app error. The cold-start page should appear at the original app URL while the runtime host is stopped or warming up.

For endpoint wake recovery, run:

pnpm wavemap -- smoke dev --wake

This calls the same-origin /__wake path, waits for frontend readiness, and then replays endpoint smoke. Use it when the host may simply be asleep, when validating the wake Lambda and dynamic origin refresh, or when the seeded browser baseline is not known to be valid.

For browser destination recovery from an intentionally stopped host, run:

pnpm wavemap -- smoke dev cold-start-browser

This starts at the seeded Sorsari artist route, expects the cold-start page at that original URL, lets the page call /__wake, waits for readiness, verifies the browser reloads into the original destination, and then replays endpoint and seeded checks.

If the host was stopped automatically by the inactivity monitor, treat recovery the same way:

Use pnpm wavemap -- smoke dev --wake for cheap endpoint recovery.
Use pnpm wavemap -- smoke dev cold-start-browser only when the seeded baseline is still valid and browser recovery is the evidence you need.
Do not add a database reset only to make browser recovery easier; reset remains a separate destructive decision.

If runtime deploy starts while the host is stopped, the runtime deploy wrapper should wake the host and retry SSM while the instance comes back online. Use the Runtime Deploy runbook and capture resume telemetry from the deploy output.

Expected evidence:

Shutdown Lambda response or workflow stage summary when a deliberate stop was performed.
Confirmation that the stopped app URL served the cold-start page.
Wake smoke result or cold-start browser smoke result.
Resume telemetry if runtime deploy performed the wake.
Playwright artifacts, shutdown response, cold-start precheck HTML, or workflow summary links when recovery fails.

Failure triage:

Symptom	First Place To Look
Cold-start page does not appear for the stopped app URL.	CloudFront custom error behavior, app-origin connection timeout settings, and static cold-start origin.
Wake call succeeds but readiness times out.	Wake Lambda logs, EC2 instance state, dynamic `app-origin.dev.wavemap.app` DNS update, and app startup logs.
Runtime deploy reports early SSM target errors.	Resume telemetry; retryable SSM registration delay is expected while a stopped host comes back.
Endpoint wake passes but browser recovery fails.	Cold-start page client reload behavior, original destination routing, and Playwright artifacts.

Runtime Host Replacement

Use this runbook when a Pulumi preview proposes replacing or destroying the deployed-dev EC2 runtime host, or when an operator intentionally chooses to recreate the runtime host for AMI, bootstrap, host-size, VPC, subnet, security-group, or instance-profile work.

Runtime host replacement is not host stop, runtime deploy, runtime rollback, or database reset. It is an infrastructure change that can discard the containerized Postgres data path and the host-side last-good runtime release receipt.

Before approval:

Confirm the target is deployed dev, Pulumi stack aws-dev, and region us-east-2.
Confirm no deploy, reset, media proof, or demo is in progress.
Run Pulumi preview and identify every create, replacement, delete, IAM change, DNS change, SSM document change, and runtime output change.
Confirm whether the preview changes only runtime-host resources or also touches media storage, CloudFront, DNS, IAM, or SSM runtime configuration.
Decide the database outcome before applying: accept data loss and reseed, preserve through a separate backup/restore drill, or abort the replacement.
Decide whether the host-side last-good release receipt matters. Replacement can remove it, so plan to record a fresh receipt after a known-good deploy.
Confirm runtime app secrets and config remain owned by SSM Parameter Store and do not need to be copied from the old host.

Preview the infrastructure change:

pnpm -C infra/pulumi run preview

Apply only after explicit approval for the replacement and data outcome:

pnpm -C infra/pulumi run up

After apply, capture fresh stack outputs:

pnpm -C infra/pulumi exec pulumi stack output --stack aws-dev --json > /tmp/wavemap-aws-dev-outputs.json

Validate the captured target and runtime contracts:

pnpm wavemap -- deploy dev cloud-target --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime-config live --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime-env --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Deploy an intended image pair to the new host:

pnpm wavemap -- deploy dev bundle --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime deploy --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output
pnpm wavemap -- smoke dev

If the chosen database outcome was “disposable reset”, reseed and prove the seeded baseline:

pnpm wavemap -- deploy dev database reset --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output
pnpm wavemap -- smoke dev seeded

After endpoint smoke passes, record a fresh last-good receipt:

pnpm wavemap -- deploy dev runtime record-release --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Expected evidence:

Pulumi preview summary naming the runtime-host replacement and any dependent resource changes.
Explicit approval note covering the database outcome.
Refreshed deployment contract after apply.
Cloud-target, runtime-config, and runtime-env readiness results.
Runtime deploy SSM command ID against the new instance.
Endpoint smoke result, and seeded smoke result if reset was selected.
Fresh last-good release receipt after the new host is proven.

Failure triage:

Symptom	First Place To Look
Runtime deploy cannot target the instance.	Refreshed deployment contract, EC2 instance state, SSM managed-instance registration, and instance profile.
SSM command runs but Docker or Compose setup fails.	EC2 bootstrap user data, Docker service status, Compose plugin install, ECR login, and `/opt/wavemap` paths.
App URL does not reach the new host.	Dynamic `app-origin.dev.wavemap.app` record, CloudFront app origin settings, security group ingress, and port 80.
App starts but seeded routes are missing.	Database outcome decision, Postgres data path, migration/reset output, and seeded smoke artifacts.
Rollback cannot find a receipt.	Expected after host replacement unless a fresh receipt has been recorded on the new host.

Media Bucket Replacement Or Destroy

Use this runbook when a Pulumi preview proposes replacing, destroying, or recreating the deployed-dev media S3 bucket or its delivery path. This includes changes that alter the bucket physical name, CloudFront media origin, bucket policy, origin access control, runtime media outputs, or runtime MEDIA_S3_BUCKET_NAME value.

The deployed-dev media bucket is intentionally disposable and currently uses forced cleanup at the infrastructure level. That makes replacement possible, not routine. Bucket replacement can delete objects and can leave database rows pointing at media that no longer exists.

Before approval:

Confirm the target is deployed dev, Pulumi stack aws-dev, and region us-east-2.
Confirm no media smoke, upload test, database reset, or demo is in progress.
Run Pulumi preview and identify every media bucket, bucket policy, CloudFront, IAM, runtime config, and output change.
Decide the object-data outcome before applying: accept deletion, preserve through a separate object-copy plan, or abort.
Decide the database outcome before applying: keep rows as-is, run destructive reset after replacement, or plan a separate row migration for locator fields.
Remember that S3 media rows store storageLocation / thumbnailStorageLocation; copying objects to a new bucket may still require a DB locator migration if old rows should remain deletable and reconcilable.
Confirm GitHub Actions will not decrypt runtime media secrets; the runtime host reads media config through SSM-rendered env files.

If current DB/S3 drift matters, capture a read-only baseline before replacement:

pnpm wavemap -- deploy dev media discrepancy-report --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Preview the infrastructure change:

pnpm -C infra/pulumi run preview

Apply only after explicit approval for object deletion/preservation and database handling:

pnpm -C infra/pulumi run up

After apply, capture fresh stack outputs:

pnpm -C infra/pulumi exec pulumi stack output --stack aws-dev --json > /tmp/wavemap-aws-dev-outputs.json

Refresh runtime media configuration and re-render the host env through runtime deploy:

pnpm wavemap -- deploy dev cloud-target --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime-config plan --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime-config populate --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute
pnpm wavemap -- deploy dev runtime-config live --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json
pnpm wavemap -- deploy dev runtime deploy --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Then prove app and media behavior:

pnpm wavemap -- smoke dev
pnpm wavemap -- smoke dev media --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute
pnpm wavemap -- smoke dev browser-media
pnpm wavemap -- deploy dev media discrepancy-report --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

If the chosen database outcome was “disposable reset”, run the reset before media/browser-media proof:

pnpm wavemap -- deploy dev database reset --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output
pnpm wavemap -- smoke dev seeded

Expected evidence:

Pulumi preview summary naming the bucket replacement/delete and dependent CloudFront/IAM/runtime-config changes.
Explicit approval note covering object handling and database handling.
Optional pre-replacement discrepancy report when preserving or interpreting existing media matters.
Fresh Pulumi outputs capture after apply.
Runtime-config plan/populate/live evidence showing the new media bucket value is ready without printing secrets.
Runtime deploy SSM command ID proving the host env was re-rendered.
Endpoint smoke, media smoke, browser media smoke, and post-replacement discrepancy report.

Failure triage:

Symptom	First Place To Look
Upload fails after replacement.	Runtime `MEDIA_S3_BUCKET_NAME`, runtime host role media policy, bucket existence, and bucket public-access block.
Upload succeeds but public media 403s.	CloudFront media origin, bucket policy for origin access control, path-prefix strip function, and object key.
Existing media rows render missing images.	Object handling decision, DB locator fields, copied object keys, and discrepancy report output.
Delete or cleanup targets the old bucket.	Stored `storageLocation` / `thumbnailStorageLocation` values and any planned locator migration.
Browser-media smoke fails but API smoke passes.	CloudFront `/media/*` routing, image URL returned by the API, browser test artifacts, and cache behavior.

Runtime Config Readiness

Use this runbook when a deploy is blocked by missing runtime configuration, a runtime parameter changed, or an operator needs to verify SSM parameter readiness before runtime deploy.

This is the operational procedure for runtime config readiness. Source ownership lives in Configuration And Secrets, and the deployed-dev environment shape lives in Deployed Dev Environment. There is no separate runtime-config operations page until the operator surface grows beyond this runbook.

Runtime config source ownership:

Pulumi outputs expose parameter names and value kinds, not plaintext secrets.
SSM Parameter Store owns deployed app runtime config and runtime secrets.
GitHub Actions should not decrypt or log SecureString runtime values.
Frontend NEXT_PUBLIC_* values are browser build inputs, not runtime SSM parameters.

For the broader source-ownership and secret-handling convention, see Configuration And Secrets.

Plan the required runtime configuration without cloud calls:

pnpm wavemap -- deploy dev runtime-config plan
pnpm wavemap -- deploy dev runtime-config plan --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Plan SSM population for non-secret String parameters:

pnpm wavemap -- deploy dev runtime-config populate --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Write only the selected non-secret String parameters after the operator accepts the plan:

pnpm wavemap -- deploy dev runtime-config populate --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute

Check live SSM metadata without reading parameter values:

pnpm wavemap -- deploy dev runtime-config live --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Verify the host env rendering contract without contacting the runtime host:

pnpm wavemap -- deploy dev runtime-env --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Stop if readiness is blocked. Fix the missing name/type/source problem first, then rerun the readiness check. Secret creation or rotation is a separate operator action; do not use runtime-config populate to create SecureString parameters.

Expected evidence:

Runtime config plan groups, required/optional counts, and redacted secret placeholders.
Live metadata readiness showing required parameters present with expected String / SecureString types.
No decrypted secret values in logs, local files, workflow summaries, or artifacts.

Runtime Deploy

Use this runbook to deploy an already selected app-runtime image pair to deployed dev. Routine branch CD uses the deploy-endpoint profile; local runtime deploy is mainly for focused repair, replaying a deploy after images already exist, or proving the runtime handoff directly.

Preconditions:

Runtime config readiness is green.
Backend and frontend images for the selected sha-<full-git-sha> tag exist in ECR, or the image build/push step will run before runtime deploy.
The deployment facts file points at the intended account, stack, region, runtime instance, and SSM document. In routine CD, this is the sanitized deployment contract read from the private artifact store.
Database reset, media proof, and lifecycle proof are selected only if they are part of the intended recipe.

Plan the runtime bundle:

pnpm wavemap -- deploy dev bundle --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Dry-run the runtime deploy command:

pnpm wavemap -- deploy dev runtime deploy --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Execute the runtime deploy:

pnpm wavemap -- deploy dev runtime deploy --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Then run endpoint smoke:

pnpm wavemap -- smoke dev

The live runtime deploy sends the Pulumi-modeled SSM document to the runtime host. If the host is stopped, the wrapper uses the same-origin wake path and retries SSM while the instance comes back online. The deploy document renders env files from SSM Parameter Store, logs into ECR on-host, pulls the selected images, and starts Docker Compose detached.

Expected evidence:

Runtime deploy SSM command ID.
Resume telemetry when the host needed to wake.
Successful command completion or a failure log that identifies the SSM send, wait, ECR pull, env render, or Compose stage.
Endpoint smoke success after deploy.

Last-Good Runtime Release Receipt

The last-good receipt is the rollback input for deployed dev. It should be written only after endpoint smoke succeeds for the current image pair.

Branch CD records the receipt automatically after a successful push-triggered deploy-endpoint run. Use the local command only when manually repairing or reestablishing the rollback point after a known-good deploy.

Dry-run the receipt write:

pnpm wavemap -- deploy dev runtime record-release --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Write the receipt after smoke has passed:

pnpm wavemap -- deploy dev runtime record-release --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

The receipt records the source git SHA, immutable image tag, backend image URI, frontend image URI, deployment version, selected ref, and workflow run metadata. It lives on the runtime host at:

/opt/wavemap/deployment/last-successful-runtime-release.json

Do not write this receipt for a deploy that has not passed endpoint smoke. Doing so would teach rollback to restore an unproven image pair.

Expected evidence:

Receipt-write SSM command ID.
Receipt path.
Source SHA, deployment version, and backend/frontend image tags matching the smoke-passing deploy.

Runtime Rollback

Runtime rollback is an app-container rollback only. It reads the host-side last-good receipt and redeploys that recorded backend/frontend image pair through the existing runtime deploy document.

Rollback does not roll back:

Database schema or rows.
SSM runtime parameters.
Media objects.
Pulumi infrastructure.
Destructive reset outcomes.

Use rollback when a runtime deploy or endpoint smoke failure leaves deployed dev on a bad app-runtime image pair and a known-good receipt exists. On push-triggered develop deploys, the workflow attempts this automatically after a selected runtime deploy or endpoint smoke failure.

Dry-run the rollback target:

pnpm wavemap -- deploy dev runtime rollback --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Execute rollback:

pnpm wavemap -- deploy dev runtime rollback --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Then rerun endpoint smoke:

pnpm wavemap -- smoke dev

If rollback smoke fails, keep the failed release visible in the workflow summary and choose the next operator action explicitly: forward fix, manual deploy of a known SHA, data reset, or infrastructure investigation.

Expected evidence:

Rollback SSM command ID.
Restored source SHA, image tag, deployment version, and receipt path.
Endpoint smoke result after rollback.
Clear note if rollback was skipped because no last-good receipt existed.

Endpoint Diagnostics

Use this runbook before repair work when deployed dev is reachable but an application route appears slow, stuck, or generically broken. The command captures public HTTP evidence only; it does not use AWS credentials, SSM, Docker, Pulumi, or database access.

Capture baseline evidence:

pnpm wavemap -- deploy dev diagnostics endpoints

Include seeded artist details when seeded data should exist:

pnpm wavemap -- deploy dev diagnostics endpoints --seeded-artist

Use a stable evidence directory when the transcript needs to be attached to a workflow note or investigation:

pnpm wavemap -- deploy dev diagnostics endpoints --evidence-dir /tmp/wavemap-endpoint-diagnostics

Expected evidence:

Evidence directory path.
Per-endpoint headers, body, curl timing metadata, and curl stderr files.
HTTP status, response content type, CloudFront cache header, and backend x-request-id when present.
Short JSON body previews for failing or diagnostic API routes.

Interpretation posture:

/en/ping passing means CloudFront can reach the frontend container.
/api/v1/health passing means the frontend rewrite can reach the backend process.
/api/v1/ready passing means the backend can reach Postgres for a shallow SELECT 1 check.
Application route failures after readiness passes usually point at app logic, schema compatibility, data shape, or downstream service behavior rather than a stopped host.

Database Migration Status And Repair

Routine deploy-endpoint CD runs this same posture automatically after runtime deploy and before endpoint smoke: read-only status first, typed gate decision second, migration-only repair only when the live DB is behind, then status again before smoke. Manual use is still useful for incident repair, educational diagnosis, or custom deploy runs where run_db_migrate=false should turn schema drift into an explicit failed gate.

Use this runbook when deployed dev is reachable but DB-backed routes fail with schema errors, or before a migration-only repair. This path is narrower than reset: it compares migration state and can run pending migrations without seeding or deleting rows.

Capture live stack outputs first when needed:

pnpm -C infra/pulumi exec pulumi stack output --stack aws-dev --json > /tmp/wavemap-aws-dev-outputs.json

Run the read-only migration status check:

pnpm wavemap -- deploy dev database status

Status reads the deployed API image’s Drizzle journal and the live database’s drizzle.__drizzle_migrations ledger, then emits status markers such as:

WAVEMAP_DATABASE_STATUS_STATE=behind
WAVEMAP_DATABASE_STATUS_EXPECTED_COUNT=21
WAVEMAP_DATABASE_STATUS_APPLIED_COUNT=13
WAVEMAP_DATABASE_STATUS_PENDING_COUNT=8

Interpretation posture:

up-to-date: The live database matches the deployed API image’s migration journal.
behind: The live database is missing migrations present in the deployed API image. Prefer migration-only repair.
ahead: The database has migration ledger rows newer than this deployed image. Stop and avoid applying older code.
hash-mismatch or drift: Migration history may have been rewritten or skipped. Stop and inspect before repair.
ledger-missing or empty: Treat as a fresh or damaged DB state; do not assume migration-only repair is safe without reviewing the data posture.

CD converts those states into a gate decision before endpoint smoke:

pass: Status is up-to-date; endpoint smoke may run.
migrate: Status is behind and migration repair is enabled; run database migrate, recheck status, then smoke.
block: Status is missing, behind without migration permission, or any drift/ahead/history-risk state; stop for manual review.

Dry-run the migration-only command:

pnpm wavemap -- deploy dev database migrate --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Apply pending migrations only:

pnpm wavemap -- deploy dev database migrate --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Then verify:

pnpm wavemap -- deploy dev database status
pnpm wavemap -- deploy dev diagnostics endpoints
pnpm wavemap -- smoke dev

Expected evidence:

Before/after database status output.
SSM command ID for any database migrate --execute run.
Endpoint diagnostics or smoke success proving the original failing route now works.

Database Reset

Use this runbook only for deployed dev. The database is disposable, but reset is still an explicit destructive action.

Decision points before reset:

Confirm the target is https://dev.wavemap.app, Pulumi stack aws-dev, and deployment environment dev.
Confirm losing non-seed database rows is acceptable.
Confirm existing S3 media objects may outlive reset because reset does not delete the media bucket.
Decide whether this is a normal disposable reset or a backup/restore learning drill.
Prefer deploy-seeded-browser, deploy-media, or deploy-lifecycle over a custom stage combination unless the one-off shape is deliberate.

Dry-run the reset:

pnpm wavemap -- deploy dev database reset --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Execute the reset:

pnpm wavemap -- deploy dev database reset --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

Then prove the canonical seed baseline:

pnpm wavemap -- smoke dev seeded

The reset runs migrations, base seed, and deterministic dev-data seed inside the deployed API container. Seeded smoke is the minimum proof that the canonical seeded route and API state exist again.

The canonical first seeded app route is:

/en/artist/ef839db3-ae41-4af9-9078-a8d211089962

Known reset warning posture:

Historical live reset proof reported event-series import warnings.
Treat those warnings as seed-data cleanup work, not as a failed reset, when migrations, base seed, deterministic dev-data seed, and seeded smoke all pass.
If the warning shape changes, treat it as fresh evidence.

Expected evidence:

Database reset SSM command ID.
Reset command success or failure log.
Seeded smoke success.
Follow-up media discrepancy report when DB/S3 divergence matters.

Media Discrepancy Report

Use this runbook after destructive reset, media smoke, manual media testing, or any investigation where database rows and S3 objects may have drifted.

Use Media Workflow And Validation for deciding whether a media change needs this report or a different proof lane.

The report is read-only. It should surface:

Media rows whose canonical or thumbnail URL points at an S3 object that no longer exists.
S3 media objects under the deployed-dev media prefix that are no longer referenced by any media row.
Counts and sampled identifiers/keys without printing app secrets or signed credentials.

Dry-run the report:

pnpm wavemap -- deploy dev media discrepancy-report --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json

Execute the read-only report on the runtime host:

pnpm wavemap -- deploy dev media discrepancy-report --pulumi-outputs-json /tmp/wavemap-aws-dev-outputs.json --execute --github-output

The live command sends an SSM command that runs the backend discrepancy report inside the deployed API container, so DB access and S3 listing use the rendered runtime environment. It does not delete rows or objects.

Non-zero discrepancy counts are telemetry, not automatic cleanup and not automatically a deploy failure. Cleanup remains a separate approval-gated action that should print the exact DB rows or object keys it will touch before mutation.

Expected evidence:

Media discrepancy SSM command ID.
Report scope.
Inspected row/object counts.
Total discrepancy count and discrepancy-kind summary.
Any sampled rows or object keys needed for follow-up.

Evidence And Failure Triage

Use workflow summaries first. Deployed-dev workflow summaries and job summaries should identify the stable facts an operator needs without turning the docs site into a run archive:

Selected profile and active add-ons.
Selected ref, resolved commit SHA, and deployment version or image tag.
Resolved stages.
Runtime deploy, reset, rollback, release-record, and discrepancy SSM command IDs when those jobs run.
Smoke result and elapsed time.
First job log or artifact to inspect on failure.

Browser-style jobs upload Playwright artifacts on failure. Database reset and media smoke lanes upload failure-only logs when selected. Keep raw logs, command outputs, and live identifiers in private workflow artifacts unless they are intentionally sanitized for docs.

Use this routing rule when deciding where evidence belongs:

Evidence Kind	Durable Home
Stable procedure, expected evidence, mutation boundary, failure triage, or redaction rule.	Curated docs under `apps/wavemap-docs/src/content/docs`.
Unsettled experiment, proof interpretation, temporary timing adjustment, or implementation log.	Working notes under `apps/wavemap-docs/working-notes`.
Change-specific proof that a PR, deploy, or workflow run behaved correctly.	Pull request description, workflow summary, or private workflow/job summary.
Raw command output, Lambda payload, browser trace, screenshot, downloaded artifact, or log.	Private artifacts with appropriate retention, unless intentionally sanitized before publication.
Run identifier, timestamp, commit SHA, SSM command ID, instance ID, public IP, or proof URL.	Private evidence, PR/workflow context, or working note while it is actively useful for follow-up.

A live proof graduates to curated docs only after it has been reduced to the reusable lesson. For example, document that cold-start browser proof should capture the shutdown response, cold-start precheck HTML, Playwright failure artifacts, and final smoke result. Do not publish the specific workflow run number, EC2 instance ID, SSM command UUID, public IP, or timing measurement unless that identifier itself is part of a reviewed operator decision.

When evidence changes an operator path, update the runbook. When evidence only proves a single run, keep it with the run.

Publishing Policy

Publish only reviewed operator paths, expected evidence rubrics, and sanitized examples that teach a stable pattern.

Do not publish:

Raw logs, command output, Lambda responses, browser traces, screenshots, or downloaded artifacts.
One-off proof narratives that do not change future operator behavior.
Live cloud identifiers such as instance IDs, public IPs, SSM command UUIDs, or temporary proof URLs.
Workflow run numbers, commit SHAs, and timestamps unless they are intentionally part of a historical decision record.
Temporary proof acceleration details such as shortened inactivity windows.

Working notes may keep this material while a decision is still moving. Once the decision settles, promote the procedure, evidence expectation, or redaction rule into curated docs and leave the noisy proof trail in PRs, workflow summaries, or private artifacts.