Conversion Pipeline: Upload to Final HTML

This document describes the complete flow from when a file is uploaded until the final accessible HTML is saved, including how the system decides which processing path to use for each file or page, and every AI service involved.

High-Level Architecture
Upload and Ingestion
Pre-Conversion Checks
Content Classification
Routing Decision Tree
Processing Pipelines
Post-Processing Pipeline
AI Services Reference
Key Files Reference

1. High-Level Architecture

The system runs on two platforms:

Cloudflare Worker (workers/api/) — handles auth, credential validation, preflight analysis, routing decisions, and lightweight conversions (Mathpix, Marker). Has a 10-minute timeout.
Node.js Server ([email protected]) — handles Puppeteer rendering, the chunk scheduler, SSE streaming, axe-core audits, and all AI-heavy conversions. Two instances (api-node-1, api-node-2) behind Traefik. No time limit.

The CF Worker proxies heavy-computation requests to Node.js via middleware/node-proxy.ts when browser-based operations (screenshot rendering, axe audit) are needed.

Storage:

System	Purpose
Cloudflare R2 / S3	PDF originals, chunk HTML fragments, final HTML output
Supabase PostgreSQL	`files` (file metadata), `large_conversion_jobs`, `chunk_jobs`, `profiles`, `credits`, cost ledger
Cloudflare KV (`KV_SESSIONS`)	Session caching, rate limits, file settings, share tokens, tenant config
Supabase Auth	Session tokens and user authentication

2. Upload and Ingestion

Source: workers/api/src/routes/files.ts

Entry Points

Endpoint	Purpose
`POST /api/files/upload`	Allocates a file ID and metadata record
`PUT /api/files/:fileId/upload-data`	Receives the actual file bytes
`POST /api/files/from-url`	Fetches a remote PDF by URL (SSRF-protected, PDF only)

Accepted File Types

PDF: application/pdf
Images: image/png, image/jpeg, image/webp, image/gif, image/tiff
DOCX: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size limits: 10 MB standard, 200 MB for async large-PDF pipeline

What Happens

POST /api/files/upload — validates MIME type, generates a UUID fileId, creates an UploadedFile metadata record with status: 'uploading', upserts it to the Supabase files table.
PUT /api/files/:fileId/upload-data — reads raw bytes, writes to R2 at users/{userId}/uploads/{fileId}/original/{filename}, sets status: 'uploaded'.
POST /api/files/from-url — validates URL with validateFetchUrl() (SSRF protection), fetches with 30-second timeout, validates Content-Type is PDF, stores identically to direct upload.

3. Pre-Conversion Checks

Source: workers/api/src/routes/convert.ts (lines 238–444)

Entry point: POST /api/convert/:fileId

Before any conversion begins, the system runs these checks in order:

Check	What It Does	Failure Response
Credential validation	Verifies required API keys exist for the requested parser	HTTP 500
Anthropic key pre-flight	Makes a lightweight `count_tokens` API call to verify the key is actually valid (not just present)	HTTP 401 with clear message
PDF page counting	Loads PDF from R2, counts pages to estimate credits needed	—
Credit check	Verifies user has enough credits (1 per page) via `checkCredits()`	HTTP 402
Spend limit check	Enforces daily/monthly page limits via `checkSpendLimits()`	HTTP 429
Dollar budget check	Estimates cost at ~$0.01/page and checks against dollar budget via `checkDollarBudget()`	HTTP 429
PDF pre-flight	Inspects PDF structure for blockers via `runPreflight()`	HTTP 422 with blocker list

Pre-flight Blockers

runPreflight() (services/pdf-preflight.ts) catches:

Encrypted/password-protected PDFs
Image-only pages (>50% of pages are pure images → not remediable)
Embedded audio/video
Corrupt/unparseable files
Embedded JavaScript

4. Content Classification

4a. Document-Level Fast Scan

Source: services/pdf-complexity-detector.ts — detectComplexContent()

Before entering the chunked pipeline, the system performs a zero-cost structural scan of the PDF using the raw operator stream (via unpdf/pdfjs). No AI calls are made.

What it detects:

Feature	How It’s Detected
Images	`paintImageXObject`, `paintInlineImageXObject`, `paintImageMaskXObject` PDF operators
Tables/figures	≥10 path-draw operations (`rectangle`, `constructPath`, `moveTo`, `lineTo`) combined with path-paint operations
Math fonts	Font names containing: `cmsy`, `cmmi`, `cmex`, `symbol`, `mathematicalpi`, `stix`, `cambria math`, `asana math`, `xits math`, `latin modern math`
Math (image-mask heuristic)	≥15 `paintImageMaskXObject` ops/page + ≥4 distinct rendered text heights (indicates sub/superscripts)

Derived flags:

isPureText — no images, no tables, no math fonts
hasMathFonts — math fonts detected on any page
isComplex — images, tables, or math present

4b. Per-Page Classification

Source: services/pdf-complexity-detector.ts — detectComplexContentPerPage()

When the smart cascade or inline path needs per-page routing, each page is classified individually:

Content Type	Criteria	Recommended Backend
`text`	No images, no tables, no math fonts	`marker`
`math`	Math fonts present, no images	`marker+temml`
`image`	Images present, no math, no tables	`gemini-flash`
`table`	Tables/figures present, no images, no math	`marker`
`mixed`	Multiple features (e.g., images + math, images + tables)	`claude-vision`

5. Routing Decision Tree

Primary Decision Flow (Auto Mode, PDF with Anthropic Key)

POST /api/convert/:fileId (parser: 'auto')
│
├─ fileType === 'docx'
│  └─ mammoth.js local conversion → post-processing
│
├─ fileType === 'image'
│  └─ Image passthrough + AI alt text → post-processing
│
└─ fileType === 'pdf' + ANTHROPIC_API_KEY available
   │
   ├─ runPreflight() ─── blockers found? → HTTP 422 PREFLIGHT_BLOCKED
   │
   └─ detectComplexContent() (fast document-level scan)
      │
      ├─ Math fonts detected + Mathpix credentials available
      │  └─ ★ ROUTE 1: Mathpix Pipeline
      │
      ├─ isPureText + Marker key available + !highFidelity
      │  └─ ★ ROUTE 2: Marker Fast Path
      │
      └─ Otherwise (complex content, or no fast-path match)
         └─ ★ ROUTE 3: Async Chunked Pipeline (primary production path)

Route 1: Mathpix Pipeline

Trigger: PDF contains math fonts AND Mathpix API credentials are configured.

Why: Mathpix natively understands LaTeX/MathML notation and produces correct mathematical markup that vision models often get wrong.

Route 2: Marker Fast Path

Trigger: PDF is pure text (no images, tables, or math) AND Marker API key available AND highFidelity is not requested.

Why: For text-only PDFs, Marker’s OCR engine is faster and cheaper than vision models, and produces accurate text extraction.

Route 3: Async Chunked Pipeline

Trigger: Everything else — complex PDFs, mixed content, high-fidelity requests.

Why: Vision models can handle any content type. Chunking enables parallel processing of large documents.

Fallback Decision Flow (No Anthropic Key)

├─ !isComplex + Marker key → Marker
├─ isComplex + !highFidelity → Smart Cascade
├─ isComplex + highFidelity → error (needs Anthropic key)
└─ no keys → error

Inline Path (Small Documents, Explicit Parser Selection)

parser === 'cascade'
└─ Smart Cascade: Marker → MathPix → Agentic Vision (tiered per page)

parser === 'auto' (inline, not async)
├─ budget tier → Smart Cascade (budget mode, Marker only)
├─ !isComplex + Marker → Marker
├─ isComplex + !highFidelity → Smart Cascade
├─ isComplex + highFidelity + >10 pages → Chunked Agentic Vision
├─ isComplex + highFidelity + ≤10 pages → Agentic Vision (whole document)
├─ !isComplex + highFidelity → Agentic Vision
├─ !isComplex + no Marker → Claude single-pass
└─ no Anthropic → Mathpix fallback → Marker fallback

6. Processing Pipelines

6.1 Mathpix Pipeline

Source: services/mathpix-pdf.ts, routes/convert.ts (lines 1904–2025)

Best for: PDFs with mathematical equations, scientific notation, LaTeX content.

Flow:

Split PDF into individual pages.
Submit each page to https://api.mathpix.com/v3/pdf (concurrency limit: 3).
Poll GET /v3/pdf/{pdfId} every 3 seconds, up to 5 minutes.
Download HTML + images via /v3/pdf/{pdfId}.html.zip.
Embed extracted images as data URIs in page HTML.
Wrap each page in <section class="pdf-page" role="region">.
Continue to post-processing.

AI involved: Mathpix proprietary ML models (math recognition, OCR). Cost: ~$0.005/page.

6.2 Marker Fast Path

Source: services/marker-converter.ts

Best for: Text-only PDFs, simple tables without images.

Flow:

POST https://api.datalab.to/api/v1/marker with output_format: html, paginate_output: true.
Poll GET /api/v1/marker/{request_id} every 3 seconds, up to 5 minutes.
Receive HTML + extracted images (base64 from response).
Fallback: if HTML not provided, convert Markdown output to basic HTML.
Continue to post-processing.

AI involved: Datalab Surya OCR (deep learning model). No Claude/Gemini calls. Cost: ~$0.006/page.

6.3 Async Chunked Pipeline (Primary Production Path)

Source: services/chunk-boundary-detector.ts, scheduler/chunk-scheduler.ts, services/chunk-processor.ts, services/chunk-assembler.ts

Best for: Complex PDFs of any size — the default for production conversions.

Step 1 — Chunk Boundary Detection

Source: services/chunk-boundary-detector.ts — detectChunkBoundaries()

Read PDF outline (bookmarks) up to 2 levels deep — bookmark pages become natural break points.
Fallback: per-page heading detection — checks first text item on each page against regex patterns (numbered sections, “Chapter”/“Part”/“Section” keywords, Roman numerals, ALL-CAPS lines).
Greedy chunking: target 20 pages per chunk (TARGET_CHUNK_SIZE_PAGES), snap to nearest natural break within lookahead, hard-split at 30 pages (MAX_CHUNK_SIZE_PAGES).
Create one chunk_jobs row per boundary in Supabase, plus a large_conversion_jobs parent record.
Return immediately to client: { jobId, asyncMode: true, totalChunks }.

Step 2 — Chunk Scheduling

Source: scheduler/chunk-scheduler.ts — ChunkScheduler

Runs continuously on the Node.js server. Every 3 seconds:

Query Supabase for pending chunks.
Claim up to 8 chunks (MAX_CONCURRENT_CHUNKS) using optimistic locking: UPDATE chunk_jobs SET status='processing' WHERE id=? AND status='pending'. If another node already claimed it, the update affects 0 rows → skip.
Process each claimed chunk via processChunk().
Every 5 cycles: reclaim stale chunks (processing > 15 minutes).
Every 10 cycles: recover orphaned jobs (all chunks done but counter mismatch).

Step 3 — Per-Chunk Processing

Source: services/chunk-processor.ts — processChunk()

Extract the chunk’s page range from the PDF via extractPageRange().
Run convertWithChunkedAgenticVision() — Gemini Flash as primary, Claude Sonnet as fallback.
Maximum 4 iterations per page (maxIterationsPerPage).
Provide precedingContextHtml (last 2,500 chars from previous chunk) for structural continuity across chunk boundaries.
Wrap result in <section class="pdf-chunk" data-chunk-index="N" data-start-page="X" data-end-page="Y">.
Store chunk HTML to R2 at users/{userId}/jobs/{jobId}/chunks/{chunkIndex}.html.
Store contextTail back to chunk_jobs.context_tail for the next chunk.
Atomically increment large_conversion_jobs.done_chunks via Supabase RPC.

Per-page model routing within a chunk:

Page Content Type	Primary Model	Escalation (if score stalls)
`text` or `table`	Claude Haiku	Claude Sonnet
`image`, `math`, or `mixed`	Claude Sonnet	—

Step 4 — Assembly

Source: services/chunk-assembler.ts — assembleChunks()

Triggered when done_chunks reaches total_chunks:

Load all chunk HTMLs from R2 in parallel.
Concatenate and repair malformed HTML (auto-close unclosed tags at chunk boundaries).
Extract embedded PDF images and insert into <img> tags from the vision model.
Run the full post-processing pipeline.
Store final HTML to R2.
Deduct credits, send completion email and web push notification.

Progress streaming: Clients subscribe to GET /api/convert/:fileId/stream (SSE) for real-time progress, chunk, complete, and error events.

6.4 Agentic Vision Converter (Core AI Engine)

Source: services/agentic-vision-converter.ts

This is the central AI engine used by the chunked pipeline, smart cascade, and high-fidelity modes. It implements an iterative visual feedback loop.

How It Works

Initial pass: Send the PDF (base64) to the vision model with a detailed prompt specifying semantic HTML rules, MathML requirements, heading hierarchy, figure/image treatment. Receive raw HTML.
Screenshot refinement loop (up to maxIterations):
1. Render current HTML in Puppeteer at 1280×1600 viewport.
2. Take a full-page PNG screenshot.
3. If layout scorer is configured (Gemini), score the screenshot against the original PDF.
  - Score ≥ 90 (layoutScoreThreshold) → stop early, quality is sufficient.
  - Score delta < 3 for 2+ consecutive passes → stalling → escalate to more expensive model if fallbackStrategy is configured.
4. Send original PDF + screenshot + current HTML to the model with a refinement prompt (“fix visual differences”).
5. If model responds NO_CHANGES_NEEDED → stop (converged).
6. Update HTML with refined version.
Return final HTML + token usage + models used.

Model Strategies

Strategy	SDK	Notes
`ClaudeVisionStrategy`	Anthropic SDK	Sends PDF as `document` block with `cache_control: ephemeral` for prompt caching
`GeminiVisionStrategy`	Google Generative AI SDK	Sends PDF as `inlineData` with `mimeType: application/pdf`

6.5 Smart Cascade Converter

Source: services/smart-cascade-converter.ts

Best for: When cost optimization matters — uses cheap tools first and only escalates to expensive vision models when quality is insufficient.

Per-Page Tiered Escalation

Page Type	Tier 1 (cheapest)	Tier 2 (if quality < 80)	Tier 3 (if still < 80)
`text` or `table`	Marker API	Gemini Flash vision	Claude Sonnet agentic
`math`	Marker + temml	Mathpix per-page image API	Gemini Flash vision
`image`	Gemini Flash vision	Claude Sonnet agentic	—
`mixed`	Gemini Flash vision	Claude Sonnet agentic	—

Quality Scoring

Each tier’s output is scored before deciding whether to escalate:

WCAG validation violations: up to −40 penalty
Semantic HTML ratio: up to +30 bonus (measures proportion of semantic elements vs raw <div>/<span>)
Structure bonuses: lang attribute, <title>, valid heading hierarchy
Threshold: Score must reach 80 (qualityThreshold) to accept a tier’s result

Budget Mode

When budgetMode: true:

Marker-only, no escalation to vision models.
Hard cost cap enforced per page (maxCostUsd).

Concurrency: Up to 8 pages in parallel (maxPagesParallel). Uses a pool where each slot refills as a page finishes — fast text pages don’t block slow vision pages.

6.6 Other Parsers

Parser	Source	Trigger	AI Involved
`claude-vision` (explicit)	`claude-converter.ts`	User explicitly selects “Claude Vision”	Claude Sonnet single-pass (no iteration)
`segmented`	`convert.ts`	User selects; requires Mathpix	Mathpix (structure) + vision models (images)
`vision-tables`	`convert.ts`	User selects table extraction	`detectComplexContentPerPage()` to find table pages, then Claude vision per page
DOCX	`convert.ts`	`.docx` file uploaded	mammoth.js (local, no AI)
Image passthrough	`convert.ts`	Image file uploaded	Vision model for alt text only

7. Post-Processing Pipeline

Applied after every converter, in this order. No matter which route a file took, it goes through the same post-processing.

Source: routes/convert.ts (lines 1497–1833), services/chunk-assembler.ts

Step-by-Step

Step	Function	What It Does	AI?
1	`enhanceImagesInHtml()`	For each extracted image, calls a vision model to generate descriptive alt text. Uses `isAltTextAcceptable()` blocklist to reject generic captions like “diagram”.	Yes — Gemini Flash, Claude, or GPT-4o-mini
2	`storeAndEmbedImages()`	Stores images to R2 and embeds as data URIs in HTML	No
3	Image extraction fallback	If converter produced no images but classification shows `image`/`mixed` pages, extracts embedded PDF image objects via `extractImagesFromPdfPages()`. If that fails, falls back to full-page Puppeteer screenshots via `renderPdfPagesAsDataUris()`.	No
4	`structurePages()`	Adds page header/footer banners to each `<section class="pdf-page">`, wraps in page-numbered sections	No
5	`optimizeDeterministic()`	Pure HTML transforms (no AI): adds `<thead>`/`<tbody>` to tables, promotes first row to `<th>`, adds `scope` attributes, converts `<br>` sequences to `<p>`, adds `aria-label`/`role="img"` to SVGs, cleans unnecessary wrapper `<div>`s, converts LaTeX to MathML via temml	No
6	`enhanceAccessibility()`	Adds `lang` attribute, `<title>`, viewport meta, skip-link, source document banner, ensures DOCTYPE	No
7	`validateAndFix()`	Custom WCAG rule checker that auto-fixes: missing alt text, empty links/buttons, missing table headers, duplicate IDs, empty headings, invalid `scope` attributes. Up to 3 fix passes.	No
8	`runAxeAudit()`	Full browser-based accessibility audit using axe-core in Puppeteer. Non-blocking; skipped if browser unavailable.	No
9	`runAxeFixLoop()`	If fixable violations remain after step 7, runs automated DOM manipulation fixes.	No
10	`wrapInDocument()`	Wraps in `<!DOCTYPE html>` skeleton with responsive CSS. High-fidelity mode adds serif fonts, table borders, figure styling.	No

Final Steps

Step	What Happens
Storage	Final HTML → R2 at `users/{userId}/output/{fileId}/index.html`
Credit deduction	1 credit per page via Supabase RPC `deduct_credits`
WCAG failure alert	If `wcagStatus.passed === false`, email alert sent to `ALERT_EMAIL` (rate-limited: once per fileId per 24 hours)
Notification	Completion email + web push notification to user

8. AI Services Reference

Service	Model ID	Provider	Role	When Used	Approx. Cost
Claude Sonnet	`claude-sonnet-4-6`	Anthropic	Primary vision converter for complex/mixed/image pages; iterative refinement with screenshot feedback	Route 3 (complex pages), Smart Cascade Tier 3, high-fidelity mode	~$3/MTok in, $15/MTok out
Claude Haiku	`claude-haiku-4-5-20251001`	Anthropic	Cheaper vision for text/table pages; API key pre-flight validation	Route 3 (simple pages), alt text generation fallback	~$0.80/MTok in, $4/MTok out
Gemini Flash	`gemini-2.5-flash`	Google	First-pass converter in chunk pipeline; layout quality scoring; image pages in cascade	Route 3 primary strategy, Smart Cascade Tier 2, layout scorer	Variable
Mathpix	Mathpix API	Mathpix	Native math/equation extraction — LaTeX, MathML output	Route 1 (math-detected PDFs), Smart Cascade Tier 2 for math pages	~$0.005/page
Marker / Surya	Datalab API	Datalab	Text extraction OCR for text-only PDFs	Route 2 (pure text), Smart Cascade Tier 1	~$0.006/page
Gemini Flash (images)	`gemini-flash`	Google	Alt text generation for extracted images	Post-processing step 1	~$0.0003/image
GPT-4o-mini	`gpt-4o-mini`	OpenAI	Optional alt text generation (if configured as `imageModel`)	Post-processing step 1 (optional)	~$0.0004/image
Claude (images)	Haiku or Sonnet	Anthropic	Alt text generation when Gemini unavailable	Post-processing step 1 (fallback)	$0.002–$0.01/image
temml	—	Local library	LaTeX → MathML rendering (no external calls)	Post-processing step 5, Marker+temml path	Free

9. Key Files Reference

File	Role
`workers/api/src/routes/convert.ts`	Main entry: pre-checks, routing decisions, inline pipeline orchestration
`workers/api/src/routes/files.ts`	File upload, URL ingestion, file management
`workers/api/src/utils/file-list.ts`	File metadata CRUD — Supabase `files` table with camelCase↔snake_case mapping
`workers/api/src/routes/convert-stream.ts`	SSE endpoint for real-time chunk progress
`workers/api/src/services/pdf-complexity-detector.ts`	Zero-cost PDF structure analysis: images, tables, math fonts per page
`workers/api/src/services/pdf-preflight.ts`	Pre-flight checks: encryption, image-only, corruption, JavaScript
`workers/api/src/services/chunk-boundary-detector.ts`	Section break detection using PDF outline and heading patterns
`workers/api/src/scheduler/chunk-scheduler.ts`	Background job runner: claims, processes, assembles chunks
`workers/api/src/services/chunk-processor.ts`	Single chunk processing using Gemini-first/Claude-fallback per-page
`workers/api/src/services/chunk-assembler.ts`	Stitches chunk fragments into final WCAG-compliant HTML document
`workers/api/src/services/agentic-vision-converter.ts`	Core iterative AI converter: initial pass + screenshot feedback loop
`workers/api/src/services/smart-cascade-converter.ts`	Per-page tiered routing: Marker → Gemini → Claude
`workers/api/src/services/marker-converter.ts`	Datalab Marker/Surya API client
`workers/api/src/services/mathpix-pdf.ts`	Mathpix API client for math-heavy PDFs
`workers/api/src/services/wcag-validator.ts`	WCAG rule checker + auto-fixer (no AI)
`workers/api/src/services/image-enhancer.ts`	AI-powered alt text generation
`workers/api/src/server.ts`	Node.js server entry with ChunkScheduler startup
`packages/shared/src/types.ts`	`UploadedFile`, `ParserOptions`, `FileStatus`, `QualityTier` types
`packages/shared/src/constants.ts`	`TARGET_CHUNK_SIZE_PAGES` (20), `MAX_CHUNK_SIZE_PAGES` (30), `CONTEXT_TAIL_CHARS` (2500)