Skip to content

AWS Setup Guide

This guide covers everything needed to deploy the Accessible PDF Converter on AWS. The AWS deployment runs alongside (not replacing) the existing Cloudflare Workers deployment.

Architecture

Frontend (Next.js on CF Pages)
|
v
Hono API (Lambda + API Gateway HTTP API)
|
+--> DynamoDB (sessions, progress, results)
+--> S3 (PDF files, converted HTML)
+--> SQS (pipeline queue)
|
v
EC2 Spot Fleet (ASG, 0-4 instances)
[Docker: Node.js + Puppeteer + Chrome]
| | |
S3 (output) DynamoDB SES (notifications)
SES (email intake) --> Lambda --> SQS --> EC2 workers

Prerequisites

  • AWS CLI configured with credentials (aws configure)
  • Node.js 20+
  • Docker installed
  • CDK CLI (npm install -g aws-cdk)

Step 1: Bootstrap CDK

One-time setup per AWS account/region:

Terminal window
cdk bootstrap aws://YOUR_ACCOUNT_ID/us-east-1

Step 2: Store API Keys in SSM Parameter Store

All API keys are stored as encrypted SSM parameters under /accessible-pdf/. The batch workers load these at startup.

Terminal window
# Required for claude-vision, unpdf-claude, mistral-claude backends
aws ssm put-parameter \
--name "/accessible-pdf/ANTHROPIC_API_KEY" \
--value "sk-ant-..." \
--type "SecureString" \
--region us-east-1
# Required for gemini-flash, gemini-pro, mistral-gemini backends
aws ssm put-parameter \
--name "/accessible-pdf/GEMINI_API_KEY" \
--value "..." \
--type "SecureString" \
--region us-east-1
# Required for mistral-ocr, mistral-gemini, mistral-claude backends
aws ssm put-parameter \
--name "/accessible-pdf/MISTRAL_API_KEY" \
--value "..." \
--type "SecureString" \
--region us-east-1
# Required for marker-api backend
aws ssm put-parameter \
--name "/accessible-pdf/MARKER_API_KEY" \
--value "..." \
--type "SecureString" \
--region us-east-1
# Required for mathpix backend
aws ssm put-parameter \
--name "/accessible-pdf/MATHPIX_APP_ID" \
--value "..." \
--type "SecureString" \
--region us-east-1
aws ssm put-parameter \
--name "/accessible-pdf/MATHPIX_APP_KEY" \
--value "..." \
--type "SecureString" \
--region us-east-1
# Optional
aws ssm put-parameter \
--name "/accessible-pdf/GITHUB_TOKEN" \
--value "ghp_..." \
--type "SecureString" \
--region us-east-1

Step 3: Deploy CDK Stacks

Terminal window
cd infra/cdk
npm install
cdk deploy --all

This creates 7 stacks:

StackResources
AccessiblePdfNetworkVPC, 2 public subnets, security group
AccessiblePdfStorageS3 bucket (accessible-pdf-files-{accountId}), DynamoDB table (accessible-pdf)
AccessiblePdfQueueSQS queue (accessible-pdf-pipeline), DLQ (accessible-pdf-pipeline-dlq)
AccessiblePdfComputeECR repo, launch template, Auto Scaling Group (0-4 spot instances)
AccessiblePdfApiLambda function, HTTP API Gateway
AccessiblePdfEmailSES receipt rule, email S3 bucket, email Lambda
AccessiblePdfMonitoringCloudWatch dashboard, alarms, SNS topic

To include email alerts for alarms:

Terminal window
cdk deploy --all --context [email protected]

Step 4: Build and Push Worker Docker Image

Terminal window
# From project root
npm install
npm run build --workspace=packages/shared
npm run build --workspace=workers/batch
# Get ECR URI from stack output
ECR_URI=$(aws cloudformation describe-stacks \
--stack-name AccessiblePdfCompute \
--query 'Stacks[0].Outputs[?OutputKey==`EcrRepoUri`].OutputValue' \
--output text)
# Login to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin $ECR_URI
# Build and push
docker build -f infra/cdk/docker/worker/Dockerfile -t $ECR_URI:latest .
docker push $ECR_URI:latest

The worker Docker image uses ghcr.io/puppeteer/puppeteer:latest (Chrome pre-installed). Instance types: c6g.large, c6a.large, m6g.large (100% spot).

Step 5: Build and Deploy Lambda Functions

API Lambda

Terminal window
cd workers/api
npm run build # outputs to dist-lambda/

The API Lambda is deployed via CDK from workers/api/dist-lambda/. To update after code changes:

Terminal window
cdk deploy AccessiblePdfApi

Email Intake Lambda

Terminal window
cd workers/email-intake
npm run build # outputs to dist/
cdk deploy AccessiblePdfEmail

Step 6: Configure SES for Email Intake

Verify Domain

Terminal window
aws ses verify-domain-identity --domain pdf.anglin.com --region us-east-1

Add DNS Records

MX record (for receiving at [email protected]):

pdf.anglin.com. MX 10 inbound-smtp.us-east-1.amazonaws.com.

SPF record (for sending reply emails):

pdf.anglin.com. TXT "v=spf1 include:amazonses.com ~all"

DKIM records (3 CNAME records generated by SES after domain verification).

DMARC record (optional):

_dmarc.pdf.anglin.com. TXT "v=DMARC1; p=none;"

Request Production Access

SES starts in sandbox mode (can only send to verified addresses). Request production access in the SES console to send to any address.

Step 7: Point Frontend to AWS API

Update the frontend environment variable:

Terminal window
# In apps/web/.env or deployment config
NEXT_PUBLIC_API_URL=https://YOUR_API_GATEWAY_URL

The API Gateway URL is output by the AccessiblePdfApi stack.

How It Works

Cloudflare (current, unchanged)

POST /api/benchmark blocks until all pipelines complete synchronously, then returns the full result.

AWS (new)

POST /api/benchmark enqueues one SQS message per pipeline (file, backend, uxOptimizer) and returns immediately with status: "running". The frontend already polls GET /api/benchmark/:sessionId every 3 seconds. EC2 workers pull messages from SQS, execute pipelines, and write results to DynamoDB. When the last pipeline finishes, the session is marked complete.

No Automatic Failover

The CF and AWS deployments are completely independent. Switching is done by changing NEXT_PUBLIC_API_URL. To fail over, point the frontend at the other URL.

Environment Variables Reference

Batch Worker (set by EC2 launch template)

VariableDescription
AWS_REGIONAWS region (default: us-east-1)
S3_BUCKETS3 bucket name
DYNAMODB_TABLEDynamoDB table name
SQS_QUEUE_URLSQS queue URL

API keys are loaded from SSM at startup, not environment variables.

API Lambda (set by CDK)

VariableDescription
S3_BUCKETS3 bucket name
DYNAMODB_TABLEDynamoDB table name
SQS_QUEUE_URLSQS queue URL
FRONTEND_URLFrontend URL for CORS
SUPABASE_JWT_SECRETJWT secret for auth

Email Intake Lambda (set by CDK)

VariableDescription
S3_BUCKETS3 bucket name
DYNAMODB_TABLEDynamoDB table name
SQS_QUEUE_URLSQS queue URL
FROM_EMAILReply-from address (default: [email protected])
FRONTEND_URLFrontend URL for result links

Auto Scaling Behavior

Queue DepthTarget Instances
0 messages0 (scale to zero)
1+ messages1
10+ messages2
50+ messages4

Cooldown: 5 minutes. Cold start: ~2-3 minutes (instance launch + Docker pull).

Spot interruption handling: workers monitor the EC2 metadata endpoint. On 2-minute warning, they stop pulling new messages and let in-progress work finish. Unfinished SQS messages become visible again after the 15-minute visibility timeout. S3 conversion caching prevents duplicate work on retry.

Monitoring

The AccessiblePdfMonitoring stack creates:

  • CloudWatch Dashboard (accessible-pdf-converter): queue depth, DLQ messages, instance count, Lambda metrics
  • Alarms: DLQ has messages, queue backlog > 50, API 5xx errors

Cost Estimate

VolumeMonthly Cost
5K docs/month~$34
10K docs/month~$66

Workers scale to zero when idle. Major cost is EC2 spot compute (~$0.02-0.03/hr per instance).