Image Embedding Strategy
Decision
All images in converted HTML output are embedded as inline base64 data URIs. The final deliverable is a single self-contained .html file with no external dependencies.
How It Works
The pipeline handles images at three points:
-
Extracted images (Marker, MathPix) —
storeAndEmbedImages()converts each image to adata:image/png;base64,...URI and replacessrcreferences in the HTML. Images are also stored to R2 for archival, but the R2 copies are never referenced from the HTML. -
Page screenshots (vision converters) — When vision converters produce
<img>tags but no actual image data (extractedImages.length === 0),renderPdfPagesAsDataUris()renders the original PDF pages as PNGs via Browser Rendering + pdf.js, thenembedPageScreenshots()injects them as data URIs into the matching<section data-page-number="N">blocks. -
AI-enhanced alt text —
enhanceImagesInHtml()runs before embedding, so images still have their original filenames for matching. Alt text is written to the<img>tags, then embedding replaces thesrcwith a data URI.
Scale factor
Page screenshots use a scale factor of 1.5 (150% of default viewport). This balances quality against file size — scale 2.0 produces sharper images but roughly doubles the PNG byte count.
Why Inline Data URIs
| Benefit | Detail |
|---|---|
| Single-file portability | Users download one .html file and open it anywhere — no broken images, no server dependency |
| Offline access | Works completely offline after download |
| No asset management | No CDN, no signed URLs, no expiring links, no CORS |
| Accessibility tools | Screen readers and assistive tech work identically whether the file is local or hosted |
| Simplicity | No need to coordinate HTML + image uploads or generate asset manifests |
Trade-offs
| Cost | Detail |
|---|---|
| File size | Base64 encoding adds ~33% overhead. A full-page PNG at scale 1.5 is typically 500KB–2MB. A 20-page image-heavy PDF can produce a 20–40MB HTML file. |
| Browser performance | Very large data URIs (>50MB total) can slow DOM parsing and increase memory usage |
| Redundant R2 storage | storeAndEmbedImages stores images to R2 and embeds them inline — the R2 copies exist only for archival/debugging |
| No incremental loading | The browser must download the entire HTML file before rendering any content (no lazy-loading of external images) |
| Cache inefficiency | If the same image appears in multiple conversions, each HTML file embeds its own copy rather than sharing a cached asset |
Alternative: External R2 References
If the product evolves to include a hosted preview mode (viewing converted documents in-browser without downloading), switching to external R2-referenced URLs would reduce HTML file sizes significantly.
The R2 path users/{userId}/output/{fileId}/assets/ already exists but is unused. An external-image approach would:
- Store images to R2 at
assets/page-{N}.png - Reference them as
<img src="/api/assets/{fileId}/page-{N}.png"> - Require signed URLs or auth middleware for private documents
- Require a zip download option (HTML + images folder) for offline use
Current decision: Not pursuing this. The single-file download model is simpler and meets current user needs. Revisit if file sizes become a user complaint or if a hosted preview feature is added.
Key Files
| File | Role |
|---|---|
workers/api/src/routes/convert.ts | storeAndEmbedImages(), embedPageScreenshots(), pipeline orchestration |
workers/api/src/utils/pdf-to-png.ts | renderPdfPageToPng(), renderPdfPagesAsDataUris() |
workers/api/src/utils/html.ts | toDataUri(), uint8ToBase64() |
workers/api/src/services/image-enhancer.ts | AI alt-text generation (runs before embedding) |