Skip to content

Conversion Pipeline: Upload to Final HTML

This document describes the complete flow from when a file is uploaded until the final accessible HTML is saved, including how the system decides which processing path to use for each file or page, and every AI service involved.


Table of Contents

  1. High-Level Architecture
  2. Upload and Ingestion
  3. Pre-Conversion Checks
  4. Content Classification
  5. Routing Decision Tree
  6. Processing Pipelines
  7. Post-Processing Pipeline
  8. AI Services Reference
  9. Key Files Reference

1. High-Level Architecture

The system runs on two platforms:

  • Cloudflare Worker (workers/api/) β€” handles auth, credential validation, preflight analysis, routing decisions, and lightweight conversions (Mathpix, Marker). Has a 10-minute timeout.
  • Node.js Server ([email protected]) β€” handles Puppeteer rendering, the chunk scheduler, SSE streaming, axe-core audits, and all AI-heavy conversions. Two instances (api-node-1, api-node-2) behind Traefik. No time limit.

The CF Worker proxies heavy-computation requests to Node.js via middleware/node-proxy.ts when browser-based operations (screenshot rendering, axe audit) are needed.

Storage:

SystemPurpose
Cloudflare R2 / S3PDF originals, chunk HTML fragments, final HTML output
Supabase PostgreSQLfiles (file metadata), large_conversion_jobs, chunk_jobs, profiles, credits, cost ledger
Cloudflare KV (KV_SESSIONS)Session caching, rate limits, file settings, share tokens, tenant config
Supabase AuthSession tokens and user authentication

2. Upload and Ingestion

Source: workers/api/src/routes/files.ts

Entry Points

EndpointPurpose
POST /api/files/uploadAllocates a file ID and metadata record
PUT /api/files/:fileId/upload-dataReceives the actual file bytes
POST /api/files/from-urlFetches a remote PDF by URL (SSRF-protected, PDF only)

Accepted File Types

  • PDF: application/pdf
  • Images: image/png, image/jpeg, image/webp, image/gif, image/tiff
  • DOCX: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • Size limits: 10 MB standard, 200 MB for async large-PDF pipeline

What Happens

  1. POST /api/files/upload β€” validates MIME type, generates a UUID fileId, creates an UploadedFile metadata record with status: 'uploading', upserts it to the Supabase files table.

  2. PUT /api/files/:fileId/upload-data β€” reads raw bytes, writes to R2 at users/{userId}/uploads/{fileId}/original/{filename}, sets status: 'uploaded'.

  3. POST /api/files/from-url β€” validates URL with validateFetchUrl() (SSRF protection), fetches with 30-second timeout, validates Content-Type is PDF, stores identically to direct upload.


3. Pre-Conversion Checks

Source: workers/api/src/routes/convert.ts (lines 238–444)

Entry point: POST /api/convert/:fileId

Before any conversion begins, the system runs these checks in order:

CheckWhat It DoesFailure Response
Credential validationVerifies required API keys exist for the requested parserHTTP 500
Anthropic key pre-flightMakes a lightweight count_tokens API call to verify the key is actually valid (not just present)HTTP 401 with clear message
PDF page countingLoads PDF from R2, counts pages to estimate credits neededβ€”
Credit checkVerifies user has enough credits (1 per page) via checkCredits()HTTP 402
Spend limit checkEnforces daily/monthly page limits via checkSpendLimits()HTTP 429
Dollar budget checkEstimates cost at ~$0.01/page and checks against dollar budget via checkDollarBudget()HTTP 429
PDF pre-flightInspects PDF structure for blockers via runPreflight()HTTP 422 with blocker list

Pre-flight Blockers

runPreflight() (services/pdf-preflight.ts) catches:

  • Encrypted/password-protected PDFs
  • Image-only pages (>50% of pages are pure images β†’ not remediable)
  • Embedded audio/video
  • Corrupt/unparseable files
  • Embedded JavaScript

4. Content Classification

4a. Document-Level Fast Scan

Source: services/pdf-complexity-detector.ts β€” detectComplexContent()

Before entering the chunked pipeline, the system performs a zero-cost structural scan of the PDF using the raw operator stream (via unpdf/pdfjs). No AI calls are made.

What it detects:

FeatureHow It’s Detected
ImagespaintImageXObject, paintInlineImageXObject, paintImageMaskXObject PDF operators
Tables/figuresβ‰₯10 path-draw operations (rectangle, constructPath, moveTo, lineTo) combined with path-paint operations
Math fontsFont names containing: cmsy, cmmi, cmex, symbol, mathematicalpi, stix, cambria math, asana math, xits math, latin modern math
Math (image-mask heuristic)β‰₯15 paintImageMaskXObject ops/page + β‰₯4 distinct rendered text heights (indicates sub/superscripts)

Derived flags:

  • isPureText β€” no images, no tables, no math fonts
  • hasMathFonts β€” math fonts detected on any page
  • isComplex β€” images, tables, or math present

4b. Per-Page Classification

Source: services/pdf-complexity-detector.ts β€” detectComplexContentPerPage()

When the smart cascade or inline path needs per-page routing, each page is classified individually:

Content TypeCriteriaRecommended Backend
textNo images, no tables, no math fontsmarker
mathMath fonts present, no imagesmarker+temml
imageImages present, no math, no tablesgemini-flash
tableTables/figures present, no images, no mathmarker
mixedMultiple features (e.g., images + math, images + tables)claude-vision

5. Routing Decision Tree

Primary Decision Flow (Auto Mode, PDF with Anthropic Key)

POST /api/convert/:fileId (parser: 'auto')
β”‚
β”œβ”€ fileType === 'docx'
β”‚ └─ mammoth.js local conversion β†’ post-processing
β”‚
β”œβ”€ fileType === 'image'
β”‚ └─ Image passthrough + AI alt text β†’ post-processing
β”‚
└─ fileType === 'pdf' + ANTHROPIC_API_KEY available
β”‚
β”œβ”€ runPreflight() ─── blockers found? β†’ HTTP 422 PREFLIGHT_BLOCKED
β”‚
└─ detectComplexContent() (fast document-level scan)
β”‚
β”œβ”€ Math fonts detected + Mathpix credentials available
β”‚ └─ β˜… ROUTE 1: Mathpix Pipeline
β”‚
β”œβ”€ isPureText + Marker key available + !highFidelity
β”‚ └─ β˜… ROUTE 2: Marker Fast Path
β”‚
└─ Otherwise (complex content, or no fast-path match)
└─ β˜… ROUTE 3: Async Chunked Pipeline (primary production path)

Route 1: Mathpix Pipeline

Trigger: PDF contains math fonts AND Mathpix API credentials are configured.

Why: Mathpix natively understands LaTeX/MathML notation and produces correct mathematical markup that vision models often get wrong.

Route 2: Marker Fast Path

Trigger: PDF is pure text (no images, tables, or math) AND Marker API key available AND highFidelity is not requested.

Why: For text-only PDFs, Marker’s OCR engine is faster and cheaper than vision models, and produces accurate text extraction.

Route 3: Async Chunked Pipeline

Trigger: Everything else β€” complex PDFs, mixed content, high-fidelity requests.

Why: Vision models can handle any content type. Chunking enables parallel processing of large documents.

Fallback Decision Flow (No Anthropic Key)

β”œβ”€ !isComplex + Marker key β†’ Marker
β”œβ”€ isComplex + !highFidelity β†’ Smart Cascade
β”œβ”€ isComplex + highFidelity β†’ error (needs Anthropic key)
└─ no keys β†’ error

Inline Path (Small Documents, Explicit Parser Selection)

parser === 'cascade'
└─ Smart Cascade: Marker β†’ MathPix β†’ Agentic Vision (tiered per page)
parser === 'auto' (inline, not async)
β”œβ”€ budget tier β†’ Smart Cascade (budget mode, Marker only)
β”œβ”€ !isComplex + Marker β†’ Marker
β”œβ”€ isComplex + !highFidelity β†’ Smart Cascade
β”œβ”€ isComplex + highFidelity + >10 pages β†’ Chunked Agentic Vision
β”œβ”€ isComplex + highFidelity + ≀10 pages β†’ Agentic Vision (whole document)
β”œβ”€ !isComplex + highFidelity β†’ Agentic Vision
β”œβ”€ !isComplex + no Marker β†’ Claude single-pass
└─ no Anthropic β†’ Mathpix fallback β†’ Marker fallback

6. Processing Pipelines

6.1 Mathpix Pipeline

Source: services/mathpix-pdf.ts, routes/convert.ts (lines 1904–2025)

Best for: PDFs with mathematical equations, scientific notation, LaTeX content.

Flow:

  1. Split PDF into individual pages.
  2. Submit each page to https://api.mathpix.com/v3/pdf (concurrency limit: 3).
  3. Poll GET /v3/pdf/{pdfId} every 3 seconds, up to 5 minutes.
  4. Download HTML + images via /v3/pdf/{pdfId}.html.zip.
  5. Embed extracted images as data URIs in page HTML.
  6. Wrap each page in <section class="pdf-page" role="region">.
  7. Continue to post-processing.

AI involved: Mathpix proprietary ML models (math recognition, OCR). Cost: ~$0.005/page.


6.2 Marker Fast Path

Source: services/marker-converter.ts

Best for: Text-only PDFs, simple tables without images.

Flow:

  1. POST https://api.datalab.to/api/v1/marker with output_format: html, paginate_output: true.
  2. Poll GET /api/v1/marker/{request_id} every 3 seconds, up to 5 minutes.
  3. Receive HTML + extracted images (base64 from response).
  4. Fallback: if HTML not provided, convert Markdown output to basic HTML.
  5. Continue to post-processing.

AI involved: Datalab Surya OCR (deep learning model). No Claude/Gemini calls. Cost: ~$0.006/page.


6.3 Async Chunked Pipeline (Primary Production Path)

Source: services/chunk-boundary-detector.ts, scheduler/chunk-scheduler.ts, services/chunk-processor.ts, services/chunk-assembler.ts

Best for: Complex PDFs of any size β€” the default for production conversions.

Step 1 β€” Chunk Boundary Detection

Source: services/chunk-boundary-detector.ts β€” detectChunkBoundaries()

  1. Read PDF outline (bookmarks) up to 2 levels deep β€” bookmark pages become natural break points.
  2. Fallback: per-page heading detection β€” checks first text item on each page against regex patterns (numbered sections, β€œChapter”/β€œPart”/β€œSection” keywords, Roman numerals, ALL-CAPS lines).
  3. Greedy chunking: target 20 pages per chunk (TARGET_CHUNK_SIZE_PAGES), snap to nearest natural break within lookahead, hard-split at 30 pages (MAX_CHUNK_SIZE_PAGES).
  4. Create one chunk_jobs row per boundary in Supabase, plus a large_conversion_jobs parent record.
  5. Return immediately to client: { jobId, asyncMode: true, totalChunks }.

Step 2 β€” Chunk Scheduling

Source: scheduler/chunk-scheduler.ts β€” ChunkScheduler

Runs continuously on the Node.js server. Every 3 seconds:

  1. Query Supabase for pending chunks.
  2. Claim up to 8 chunks (MAX_CONCURRENT_CHUNKS) using optimistic locking: UPDATE chunk_jobs SET status='processing' WHERE id=? AND status='pending'. If another node already claimed it, the update affects 0 rows β†’ skip.
  3. Process each claimed chunk via processChunk().
  4. Every 5 cycles: reclaim stale chunks (processing > 15 minutes).
  5. Every 10 cycles: recover orphaned jobs (all chunks done but counter mismatch).

Step 3 β€” Per-Chunk Processing

Source: services/chunk-processor.ts β€” processChunk()

  1. Extract the chunk’s page range from the PDF via extractPageRange().
  2. Run convertWithChunkedAgenticVision() β€” Gemini Flash as primary, Claude Sonnet as fallback.
  3. Maximum 4 iterations per page (maxIterationsPerPage).
  4. Provide precedingContextHtml (last 2,500 chars from previous chunk) for structural continuity across chunk boundaries.
  5. Wrap result in <section class="pdf-chunk" data-chunk-index="N" data-start-page="X" data-end-page="Y">.
  6. Store chunk HTML to R2 at users/{userId}/jobs/{jobId}/chunks/{chunkIndex}.html.
  7. Store contextTail back to chunk_jobs.context_tail for the next chunk.
  8. Atomically increment large_conversion_jobs.done_chunks via Supabase RPC.

Per-page model routing within a chunk:

Page Content TypePrimary ModelEscalation (if score stalls)
text or tableClaude HaikuClaude Sonnet
image, math, or mixedClaude Sonnetβ€”

Step 4 β€” Assembly

Source: services/chunk-assembler.ts β€” assembleChunks()

Triggered when done_chunks reaches total_chunks:

  1. Load all chunk HTMLs from R2 in parallel.
  2. Concatenate and repair malformed HTML (auto-close unclosed tags at chunk boundaries).
  3. Extract embedded PDF images and insert into <img> tags from the vision model.
  4. Run the full post-processing pipeline.
  5. Store final HTML to R2.
  6. Deduct credits, send completion email and web push notification.

Progress streaming: Clients subscribe to GET /api/convert/:fileId/stream (SSE) for real-time progress, chunk, complete, and error events.


6.4 Agentic Vision Converter (Core AI Engine)

Source: services/agentic-vision-converter.ts

This is the central AI engine used by the chunked pipeline, smart cascade, and high-fidelity modes. It implements an iterative visual feedback loop.

How It Works

  1. Initial pass: Send the PDF (base64) to the vision model with a detailed prompt specifying semantic HTML rules, MathML requirements, heading hierarchy, figure/image treatment. Receive raw HTML.

  2. Screenshot refinement loop (up to maxIterations):

    1. Render current HTML in Puppeteer at 1280Γ—1600 viewport.
    2. Take a full-page PNG screenshot.
    3. If layout scorer is configured (Gemini), score the screenshot against the original PDF.
      • Score β‰₯ 90 (layoutScoreThreshold) β†’ stop early, quality is sufficient.
      • Score delta < 3 for 2+ consecutive passes β†’ stalling β†’ escalate to more expensive model if fallbackStrategy is configured.
    4. Send original PDF + screenshot + current HTML to the model with a refinement prompt (β€œfix visual differences”).
    5. If model responds NO_CHANGES_NEEDED β†’ stop (converged).
    6. Update HTML with refined version.
  3. Return final HTML + token usage + models used.

Model Strategies

StrategySDKNotes
ClaudeVisionStrategyAnthropic SDKSends PDF as document block with cache_control: ephemeral for prompt caching
GeminiVisionStrategyGoogle Generative AI SDKSends PDF as inlineData with mimeType: application/pdf

6.5 Smart Cascade Converter

Source: services/smart-cascade-converter.ts

Best for: When cost optimization matters β€” uses cheap tools first and only escalates to expensive vision models when quality is insufficient.

Per-Page Tiered Escalation

Page TypeTier 1 (cheapest)Tier 2 (if quality < 80)Tier 3 (if still < 80)
text or tableMarker APIGemini Flash visionClaude Sonnet agentic
mathMarker + temmlMathpix per-page image APIGemini Flash vision
imageGemini Flash visionClaude Sonnet agenticβ€”
mixedGemini Flash visionClaude Sonnet agenticβ€”

Quality Scoring

Each tier’s output is scored before deciding whether to escalate:

  • WCAG validation violations: up to βˆ’40 penalty
  • Semantic HTML ratio: up to +30 bonus (measures proportion of semantic elements vs raw <div>/<span>)
  • Structure bonuses: lang attribute, <title>, valid heading hierarchy
  • Threshold: Score must reach 80 (qualityThreshold) to accept a tier’s result

Budget Mode

When budgetMode: true:

  • Marker-only, no escalation to vision models.
  • Hard cost cap enforced per page (maxCostUsd).

Concurrency: Up to 8 pages in parallel (maxPagesParallel). Uses a pool where each slot refills as a page finishes β€” fast text pages don’t block slow vision pages.


6.6 Other Parsers

ParserSourceTriggerAI Involved
claude-vision (explicit)claude-converter.tsUser explicitly selects β€œClaude Vision”Claude Sonnet single-pass (no iteration)
segmentedconvert.tsUser selects; requires MathpixMathpix (structure) + vision models (images)
vision-tablesconvert.tsUser selects table extractiondetectComplexContentPerPage() to find table pages, then Claude vision per page
DOCXconvert.ts.docx file uploadedmammoth.js (local, no AI)
Image passthroughconvert.tsImage file uploadedVision model for alt text only

7. Post-Processing Pipeline

Applied after every converter, in this order. No matter which route a file took, it goes through the same post-processing.

Source: routes/convert.ts (lines 1497–1833), services/chunk-assembler.ts

Step-by-Step

StepFunctionWhat It DoesAI?
1enhanceImagesInHtml()For each extracted image, calls a vision model to generate descriptive alt text. Uses isAltTextAcceptable() blocklist to reject generic captions like β€œdiagram”.Yes β€” Gemini Flash, Claude, or GPT-4o-mini
2storeAndEmbedImages()Stores images to R2 and embeds as data URIs in HTMLNo
3Image extraction fallbackIf converter produced no images but classification shows image/mixed pages, extracts embedded PDF image objects via extractImagesFromPdfPages(). If that fails, falls back to full-page Puppeteer screenshots via renderPdfPagesAsDataUris().No
4structurePages()Adds page header/footer banners to each <section class="pdf-page">, wraps in page-numbered sectionsNo
5optimizeDeterministic()Pure HTML transforms (no AI): adds <thead>/<tbody> to tables, promotes first row to <th>, adds scope attributes, converts <br> sequences to <p>, adds aria-label/role="img" to SVGs, cleans unnecessary wrapper <div>s, converts LaTeX to MathML via temmlNo
6enhanceAccessibility()Adds lang attribute, <title>, viewport meta, skip-link, source document banner, ensures DOCTYPENo
7validateAndFix()Custom WCAG rule checker that auto-fixes: missing alt text, empty links/buttons, missing table headers, duplicate IDs, empty headings, invalid scope attributes. Up to 3 fix passes.No
8runAxeAudit()Full browser-based accessibility audit using axe-core in Puppeteer. Non-blocking; skipped if browser unavailable.No
9runAxeFixLoop()If fixable violations remain after step 7, runs automated DOM manipulation fixes.No
10wrapInDocument()Wraps in <!DOCTYPE html> skeleton with responsive CSS. High-fidelity mode adds serif fonts, table borders, figure styling.No

Final Steps

StepWhat Happens
StorageFinal HTML β†’ R2 at users/{userId}/output/{fileId}/index.html
Credit deduction1 credit per page via Supabase RPC deduct_credits
WCAG failure alertIf wcagStatus.passed === false, email alert sent to ALERT_EMAIL (rate-limited: once per fileId per 24 hours)
NotificationCompletion email + web push notification to user

8. AI Services Reference

ServiceModel IDProviderRoleWhen UsedApprox. Cost
Claude Sonnetclaude-sonnet-4-6AnthropicPrimary vision converter for complex/mixed/image pages; iterative refinement with screenshot feedbackRoute 3 (complex pages), Smart Cascade Tier 3, high-fidelity mode~$3/MTok in, $15/MTok out
Claude Haikuclaude-haiku-4-5-20251001AnthropicCheaper vision for text/table pages; API key pre-flight validationRoute 3 (simple pages), alt text generation fallback~$0.80/MTok in, $4/MTok out
Gemini Flashgemini-2.5-flashGoogleFirst-pass converter in chunk pipeline; layout quality scoring; image pages in cascadeRoute 3 primary strategy, Smart Cascade Tier 2, layout scorerVariable
MathpixMathpix APIMathpixNative math/equation extraction β€” LaTeX, MathML outputRoute 1 (math-detected PDFs), Smart Cascade Tier 2 for math pages~$0.005/page
Marker / SuryaDatalab APIDatalabText extraction OCR for text-only PDFsRoute 2 (pure text), Smart Cascade Tier 1~$0.006/page
Gemini Flash (images)gemini-flashGoogleAlt text generation for extracted imagesPost-processing step 1~$0.0003/image
GPT-4o-minigpt-4o-miniOpenAIOptional alt text generation (if configured as imageModel)Post-processing step 1 (optional)~$0.0004/image
Claude (images)Haiku or SonnetAnthropicAlt text generation when Gemini unavailablePost-processing step 1 (fallback)$0.002–$0.01/image
temmlβ€”Local libraryLaTeX β†’ MathML rendering (no external calls)Post-processing step 5, Marker+temml pathFree

9. Key Files Reference

FileRole
workers/api/src/routes/convert.tsMain entry: pre-checks, routing decisions, inline pipeline orchestration
workers/api/src/routes/files.tsFile upload, URL ingestion, file management
workers/api/src/utils/file-list.tsFile metadata CRUD β€” Supabase files table with camelCase↔snake_case mapping
workers/api/src/routes/convert-stream.tsSSE endpoint for real-time chunk progress
workers/api/src/services/pdf-complexity-detector.tsZero-cost PDF structure analysis: images, tables, math fonts per page
workers/api/src/services/pdf-preflight.tsPre-flight checks: encryption, image-only, corruption, JavaScript
workers/api/src/services/chunk-boundary-detector.tsSection break detection using PDF outline and heading patterns
workers/api/src/scheduler/chunk-scheduler.tsBackground job runner: claims, processes, assembles chunks
workers/api/src/services/chunk-processor.tsSingle chunk processing using Gemini-first/Claude-fallback per-page
workers/api/src/services/chunk-assembler.tsStitches chunk fragments into final WCAG-compliant HTML document
workers/api/src/services/agentic-vision-converter.tsCore iterative AI converter: initial pass + screenshot feedback loop
workers/api/src/services/smart-cascade-converter.tsPer-page tiered routing: Marker β†’ Gemini β†’ Claude
workers/api/src/services/marker-converter.tsDatalab Marker/Surya API client
workers/api/src/services/mathpix-pdf.tsMathpix API client for math-heavy PDFs
workers/api/src/services/wcag-validator.tsWCAG rule checker + auto-fixer (no AI)
workers/api/src/services/image-enhancer.tsAI-powered alt text generation
workers/api/src/server.tsNode.js server entry with ChunkScheduler startup
packages/shared/src/types.tsUploadedFile, ParserOptions, FileStatus, QualityTier types
packages/shared/src/constants.tsTARGET_CHUNK_SIZE_PAGES (20), MAX_CHUNK_SIZE_PAGES (30), CONTEXT_TAIL_CHARS (2500)