Skip to content

Cloud Run Failover

How It Works

The CF Worker (api-pdf.theaccessible.org) proxies heavy processing requests (conversions, exports, remediation) to backend Node.js servers. Backends are configured as an ordered, comma-separated list in NODE_API_URLS. Each backend gets its own circuit breaker. The Worker routes to the first healthy backend.

User Request
β”‚
β–Ό
Cloudflare Worker (api-pdf.theaccessible.org)
β”‚
β”œβ”€β”€ 1. https://api.theaccessible.org (self-hosted, primary)
β”‚ Circuit breaker: 2 failures β†’ open for 30s
β”‚
β”œβ”€β”€ 2. https://pdf-api-763162717299.us-central1.run.app (Cloud Run, failover)
β”‚ Circuit breaker: same logic, independent state
β”‚
└── 3. All backends down β†’ 503 "PDF processing server temporarily unavailable"

Proxied Routes

Only POST requests matching these patterns are proxied:

  • /api/convert/:fileId β€” PDF conversion
  • /api/convert/:fileId/preview β€” Preview conversion
  • /api/remediate/html β€” HTML remediation
  • /api/remediate/batch β€” Batch remediation
  • /api/remediate/url β€” URL remediation
  • /api/gateway/convert β€” Full pipeline conversion

All other requests are handled directly by the CF Worker.

Circuit Breaker

Each backend has independent circuit breaker state:

SettingValue
Health check interval30s (cached between checks)
Health check timeout3s
Failure threshold2 consecutive failures to open
Cooldown30s before retrying an open circuit
Proxy timeout5 minutes (for long conversions)

The health check hits GET /health on each backend.

Authentication

Cloud Run is publicly accessible (IAM allows allUsers) but protected by a shared secret:

  • The CF Worker sends X-Proxy-Secret header on all proxied requests
  • The Node server middleware (server.ts) validates this header when PROXY_SHARED_SECRET env var is set
  • Health checks (/health, /) are exempt so Cloud Run can probe the container
  • Local servers don’t set PROXY_SHARED_SECRET, so the check is skipped

Cold Start

Cloud Run is configured with min-instances=0 (scales to zero when idle).

  • Cold start latency: 5-15 seconds
  • Impact: On first failover request, the circuit breaker health check (3s timeout) may fail against a cold Cloud Run instance. The circuit opens for 30s. On the next attempt after cooldown, Cloud Run is warm and responds.
  • To eliminate cold starts: Set --min-instances 1 (~$15/mo)

Statelessness

Both backends share the same external services β€” no data sync needed:

  • R2 bucket: accessible-pdf-files (via S3-compatible API)
  • KV namespaces: KV_SESSIONS and KV_RATE_LIMIT (via CF REST API)
  • Supabase: Same project (vuvwmfxssjosfphzpzim)
  • AI APIs: Same keys (Anthropic, Gemini, Mathpix)

Key Files

FileRole
workers/api/src/services/node-proxy.tsMulti-backend proxy with per-URL circuit breakers
workers/api/src/middleware/node-proxy.tsHono middleware β€” reads NODE_API_URLS, passes to proxy
workers/api/src/server.tsNode server β€” shared secret gate middleware
workers/api/src/types/env.tsNODE_API_URLS, PROXY_SHARED_SECRET type defs
workers/api/wrangler.tomlProduction NODE_API_URLS config

Infrastructure

GCP Project

  • Project ID: pdf-theaccessible-org
  • Region: us-central1
  • Artifact Registry: us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server

Cloud Run Service

  • Service: pdf-api
  • URL: https://pdf-api-763162717299.us-central1.run.app
  • CPU: 2 vCPUs
  • Memory: 2 GiB
  • Min instances: 0 (scales to zero)
  • Max instances: 3
  • Timeout: 300s (5 min)
  • Port: 8790

Secrets (Google Secret Manager)

All sensitive values stored in Secret Manager, bound to Cloud Run:

R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, CF_API_TOKEN, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_JWT_SECRET, ANTHROPIC_API_KEY, GEMINI_API_KEY, MATHPIX_APP_ID, MATHPIX_APP_KEY, RESEND_API_KEY, PROXY_SHARED_SECRET

Non-sensitive env vars set directly: ENVIRONMENT, FRONTEND_URL, R2_ACCOUNT_ID, R2_BUCKET_NAME, CF_ACCOUNT_ID, KV_SESSIONS_NAMESPACE_ID, SUPABASE_URL, ALERT_EMAIL

CF Worker Secrets

  • NODE_API_URLS β€” comma-separated backend list
  • PROXY_SHARED_SECRET β€” shared secret for Cloud Run auth

Cost

ScenarioCost
Idle (no failover events)$0/mo
1-hour outage, moderate load~$0.05-0.50
1-day outage, moderate load~$1-5
Always-warm (min-instances=1)~$15/mo

Deployment

When the Node server code changes, rebuild and redeploy Cloud Run:

Terminal window
# On the Node server (10.1.1.4)
ssh -i ~/.ssh/nightly-audit [email protected]
cd ~/accessible-pdf-converter
git pull
# Build and push
docker build -t us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \
-f workers/api/Dockerfile .
docker push us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest
# Redeploy Cloud Run
gcloud run deploy pdf-api \
--image us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \
--region us-central1

Updating Secrets

Terminal window
# Update a secret value
echo -n "NEW_VALUE" | gcloud secrets versions add SECRET_NAME --data-file=-
# Cloud Run picks up `:latest` on next deploy
gcloud run deploy pdf-api \
--image us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \
--region us-central1

Testing the Failover

1. Verify Cloud Run is healthy

Terminal window
curl -s https://pdf-api-763162717299.us-central1.run.app/health
# Should return: {"success":true,"data":{"status":"healthy",...}}

2. Verify shared secret blocks unauthenticated requests

Terminal window
curl -s -o /dev/null -w "%{http_code}" https://pdf-api-763162717299.us-central1.run.app/api/files
# Should return: 401

3. Verify normal traffic goes to local servers

With local servers running, submit a conversion via the UI or API. Check local server logs:

Terminal window
ssh -i ~/.ssh/nightly-audit [email protected]
docker compose logs -f api-node-1 api-node-2

You should see the request logged on the local server, not Cloud Run.

4. Simulate an outage

Stop the local servers:

Terminal window
ssh -i ~/.ssh/nightly-audit [email protected]
cd ~/accessible-pdf-converter
docker compose stop api-node-1 api-node-2

5. Trigger failover

Submit a conversion. The CF Worker will:

  1. Health-check api.theaccessible.org β€” fails (servers stopped)
  2. Record failure, check again β€” second failure opens circuit breaker
  3. Try next URL: pdf-api-763162717299.us-central1.run.app β€” healthy
  4. Proxy the request to Cloud Run

Check Cloud Run logs:

Terminal window
gcloud run services logs read pdf-api --region us-central1 --limit=20

6. Restore local servers

Terminal window
ssh -i ~/.ssh/nightly-audit [email protected]
cd ~/accessible-pdf-converter
docker compose up -d api-node-1 api-node-2

After 30s (circuit breaker cooldown), the next request will health-check the local server again. Once healthy, traffic returns to local.

7. Verify recovery

Submit another conversion. Check local server logs to confirm traffic is back on the primary.