Skip to content

Consolidate to a Single Production AWS Stack

Problem

We have staging-named CDK stacks (AccessiblePdfStaging-*) running production workloads. The staging Lambda, SQS, DynamoDB, and email stacks are all configured with manual overrides to point at production resources (production SQS queue, production DynamoDB table, production ECR image). Every CDK redeploy risks reverting these manual env var changes back to staging defaults.

Current state of manual overrides

ResourceCDK default (staging)Manual override (production)
Lambda SQS_QUEUE_URLaccessible-pdf-staging-pipelineaccessible-pdf-production-pipeline
Lambda IAM (inline policy)staging queue onlyadded ProductionSqsAccess
Email Lambda DYNAMODB_TABLEaccessible-pdf-staging-dataaccessible-pdf-production-data
Email Lambda SQS_QUEUE_URLstaging pipelineproduction pipeline
Email Lambda FROM_EMAIL[email protected][email protected]
Email Lambda IAMstaging SQS + DynamoDB onlyadded ProductionSqsAccess + ProductionDynamoAccess
SES receipt rule[email protected][email protected] (in helpdesk rule set)
EC2 workersPull from accessible-pdf-production-worker ECRCorrect (production compute stack)
Staging Compute stackWas runningDeleted
Staging Monitoring stackWas runningDeleted

Risk: Running cdk deploy --all will recreate staging compute/monitoring stacks, reset Lambda env vars to staging defaults, and break the production workflow.

Goal

Deploy a single production CDK stack set that:

  • Uses production resource names and configuration
  • Eliminates all manual env var overrides
  • Is safe to cdk deploy --all without breaking anything
  • Keeps the staging environment definition for future use but does not deploy it by default

Plan

Phase 1: Update CDK to deploy production by default

File: infra/cdk/bin/app.ts

Currently the CDK app deploys staging stacks. Change it to deploy production stacks by default, controlled by an environment variable.

// Current: hardcoded staging
const config = getEnvConfig('staging');
// Change to: default production, override with CDK_ENV
const config = getEnvConfig(process.env.CDK_ENV || 'production');

This means cdk deploy deploys production. To deploy staging in the future: CDK_ENV=staging cdk deploy.

Phase 2: Update env-config.ts production values

File: infra/cdk/lib/env-config.ts

Verify the production config matches what’s actually running. Current production config looks correct:

production: {
environment: 'production',
maxWorkerInstances: 4,
enablePitr: true,
alertEmail: '[email protected]',
nodeEnv: 'production',
frontendUrl: 'https://pdf.theaccessible.org',
fromEmail: '[email protected]',
emailRecipient: '[email protected]',
s3BucketName: 'accessible-pdf-files',
}

No changes needed β€” these values are already correct for production.

Phase 3: Fix api-stack.ts SQS queue reference

File: infra/cdk/lib/stacks/api-stack.ts

The Lambda env var SQS_QUEUE_URL comes from props.queue.queueUrl. When deploying as production, this will automatically be accessible-pdf-production-pipeline. No code change needed β€” deploying as production fixes this.

However, add the CloudWatch Logs IAM permission that we added manually (already in code from earlier today). Verify it’s present.

Phase 4: Fix email-stack.ts

File: infra/cdk/lib/stacks/email-stack.ts

The email stack gets FROM_EMAIL, FRONTEND_URL, SQS_QUEUE_URL, and DYNAMODB_TABLE from CDK config/props. When deployed as production, all values will be correct automatically.

One change needed: the SES receipt rule is created by CDK in its own rule set (accessible-pdf-production-email), but the active rule set is helpdesk. CDK can’t control which rule set is active (that’s an account-level setting). Two options:

Option A (recommended): Remove the SES receipt rule from CDK. Manage it manually in the helpdesk rule set (where it already lives). Add a comment in email-stack.ts explaining this.

Option B: Have CDK create the rule but document that it must be manually copied to the helpdesk rule set.

Phase 5: Delete staging stacks

Delete all remaining staging CloudFormation stacks. The staging compute and monitoring stacks are already deleted. Remaining:

Terminal window
# Order matters β€” dependencies must be deleted last
aws cloudformation delete-stack --stack-name AccessiblePdfStaging-Email --region us-east-1
# Wait for completion
aws cloudformation delete-stack --stack-name AccessiblePdfStaging-Api --region us-east-1
# Wait for completion
aws cloudformation delete-stack --stack-name AccessiblePdfStaging-Queue --region us-east-1
# Wait for completion
aws cloudformation delete-stack --stack-name AccessiblePdfStaging-Storage --region us-east-1
# Wait for completion
aws cloudformation delete-stack --stack-name AccessiblePdfStaging-Network --region us-east-1

Before deleting:

  • Verify the production stacks exist and are healthy: aws cloudformation list-stacks --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --region us-east-1
  • Verify the production Lambda is serving traffic: curl https://api-pdf.theaccessible.org/health
  • The staging DynamoDB table (accessible-pdf-staging-data) may have data from earlier test runs β€” check if anything needs to be preserved

After deleting:

  • Purge the staging SQS queue if it still exists
  • Delete the staging ECR repo images if the stack deletion fails on ECR (same issue we hit before)
  • Remove the accessible-pdf-staging-email SES rule set (orphaned, not active)

Phase 6: Remove inline IAM policies

The manual IAM policies we added (ProductionSqsAccess, ProductionDynamoAccess) will become redundant once the production stacks own the correct resources. After deploying production stacks:

Terminal window
# These were added as workarounds β€” production CDK stack grants the correct permissions
ROLE=$(aws lambda get-function --function-name accessible-pdf-production-api --query 'Configuration.Role' --output text --region us-east-1 | sed 's/.*\///')
aws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionSqsAccess
ROLE=$(aws lambda get-function --function-name accessible-pdf-production-email-intake --query 'Configuration.Role' --output text --region us-east-1 | sed 's/.*\///')
aws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionSqsAccess
aws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionDynamoAccess

Phase 7: Deploy production stacks

Terminal window
cd infra/cdk
cdk deploy --all --require-approval never

This creates AccessiblePdfProd-* stacks (Network, Storage, Queue, Api, Email). The production compute and monitoring stacks already exist.

After deploy:

  • Verify health: curl https://api-pdf.theaccessible.org/health β†’ should show platform: aws
  • Update Cloudflare LB origin if the API Gateway URL changes
  • Update the SES helpdesk rule set if the email Lambda ARN changes
  • Rebuild and push the worker Docker image to the production ECR repo if needed

Phase 8: Update build/deploy scripts

File: workers/api/package.json

Add deploy scripts that make production the default:

{
"deploy:lambda": "npm run build:lambda && cd ../../infra/cdk && npx cdk deploy AccessiblePdfProd-Api --require-approval never",
"deploy:email": "cd ../email-intake && npm run build && cd ../../infra/cdk && npx cdk deploy AccessiblePdfProd-Email --require-approval never",
"deploy:all": "npm run build:lambda && cd ../email-intake && npm run build && cd ../../infra/cdk && npx cdk deploy --all --require-approval never"
}

Phase 9: Update Cloudflare Load Balancer

If the API Gateway URL changes (new production stack = new API Gateway), update the aws-primary pool origin in the Cloudflare Load Balancer.

Check with: cdk deploy output will show the new ApiUrl.

Execution order

StepActionRiskRollback
1Update bin/app.ts to default to productionNone (code change only)Revert file
2Deploy production stacks with cdk deploy --allMedium β€” new API Gateway URL may differUse old staging stacks until LB updated
3Update CF LB origin if API Gateway URL changedLowRevert LB config
4Update SES helpdesk rule with new email Lambda ARNLowUpdate ARN back
5Verify everything works end-to-endβ€”β€”
6Delete staging stacksLow (staging is unused)Can’t easily undo, but not needed
7Clean up inline IAM policiesLowRe-add if needed
8Commit deploy scriptsNoneβ€”

What NOT to change

  • Production compute stack (AccessiblePdfProd-Compute) β€” already running, workers healthy
  • Production monitoring stack (AccessiblePdfProd-Monitoring) β€” already running
  • Cloudflare Load Balancer β€” only update origin if API Gateway URL changes
  • SES helpdesk rule set β€” keep as active set, just update Lambda ARN if it changes
  • SSM parameters β€” shared across staging/production, no changes needed

Effort estimate

TaskEffort
Update CDK code (app.ts, email-stack.ts)30 min
Deploy production stacks15 min
Update LB + SES rule15 min
Delete staging stacks15 min
Verify end-to-end30 min
Update deploy scripts + commit15 min
Total~2 hours