source.html — Debugging Converter Output
For every successful conversion the platform stores three HTML artifacts in R2:
| Artifact | Path | What it is |
|---|---|---|
index.html | users/{userId}/output/{fileId}/index.html | Final user-facing HTML — after all post-processing (ux-optimizer, mathml-validator, axe-fix-loop, visual-polish, etc.). This is what the dashboard preview and the WeasyPrint pipeline see. |
ir.xhtml | users/{userId}/output/{fileId}/ir.xhtml | XHTML 1.0 Strict version of the same content (EPUB-ready). |
source.html | users/{userId}/output/{fileId}/source.html | Raw converter output before any post-processing. This is what Claude vision / Mathpix / Marker actually emitted, before the deterministic clean-up passes mutated it. |
source.html is the most important debugging artifact in the system. Use it whenever a conversion’s output looks wrong.
Why source.html exists
The post-processing passes are defensive — they catch malformed output and try to make it presentable. That defense erases evidence of the original bug:
mathml-validatorwraps<math>raw LaTeX</math>blocks in<code class="math-fallback">— at which point you can no longer tell whether the converter emitted the bad math or whether some upstream step corrupted it.wrapBareLatexwraps any plaintext with two or more LaTeX commands in the same<code class="math-fallback">shape — the original surrounding context is gone.convertBareLatexToMath(#527) promotes bare LaTeX to a real<math>element — useful, but means you can’t tell fromindex.htmlwhether Claude emitted the<math>or whether we synthesised it.- Image extraction renames files to
images/page-N-img-M.png— if Claude referencedimages/foo.pngin its raw markup, that’s lost.
source.html is captured before any of those passes run, so it shows exactly the shape Claude (or whatever converter ran) produced. Most converter regressions are diagnosed by diffing source.html against index.html.
How to retrieve source.html
Option 1 — REST API (recommended)
The export router exposes it at:
GET /api/export/source-html/:fileIdAuthenticated as the file’s owner. Returns the raw HTML with proper MIME type, plus the standard Cache-Control and CORS headers.
Option 2 — Direct from R2 (admin)
Build the path with the shared helper:
import { R2_PATHS } from '@accessible-pdf/shared';const key = R2_PATHS.sourceHtml(userId, fileId);// users/{userId}/output/{fileId}/source.htmlThen pull via the AWS S3 client pointed at the R2 endpoint:
"AWS_ACCESS_KEY_ID=$R2_ACCESS_KEY_ID \ AWS_SECRET_ACCESS_KEY=$R2_SECRET_ACCESS_KEY \ aws s3 cp \ s3://accessible-pdf-files/users/{userId}/output/{fileId}/source.html \ /tmp/source.html \ --endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com"R2 credentials live in /home/larry/accessible/.env.node-server on 10.1.1.4.
Option 3 — From a recent conversion (quick check)
List artifacts for any file:
aws s3 ls \ s3://accessible-pdf-files/users/{userId}/output/{fileId}/ \ --endpoint-url https://{accountId}.r2.cloudflarestorage.comExpected listing: accessible.pdf, cost-analysis.html, index.html, ir.xhtml, source.html, styled.html, plus per-conversion images under assets/.
If source.html is missing for a recent conversion, file an issue — its absence is the bug.
How to read it
source.html is what Claude / Mathpix / Marker actually emitted, not a debug rendering. It is valid HTML and can be opened directly in a browser. Diff it against index.html to see what each post-processing pass changed:
diff <(curl -s .../source.html) <(curl -s .../index.html) | lessCommon diagnoses you can read directly from source.html:
Symptom in index.html | What to check in source.html |
|---|---|
<code class="math-fallback"> wrapping a math element | Did the converter emit <math>raw LaTeX</math> (validator wrapped it) or bare \sum…\binom… text in a <p> (wrapBareLatex wrapped it)? |
| Image at the wrong location or missing | What src attribute does the source <img> use? Does the file actually exist in uploads/{fileId}/images/? |
Table rendered as a list of <p> paragraphs | Did the converter emit a <table> at all, or did it flatten the layout to flowing text? |
| Math renders wrong only in the PDF | Inspect both source.html (Claude’s output) and check the converter’s MathML shape — quirky attributes like tml-med-pad can survive the HTML but trip the Node-side prerender. |
When the converter pipeline isn’t chunked-vision
source.html is written by every conversion path that goes through storeIrAndHtml — including chunk-assembler (chunked-vision) and the struct-table route (#525, fixed in #529). Mathpix and pure-Marker conversions that take a different storage path may not produce a source.html. If you’re investigating a non-vision conversion and there’s no source.html, that’s a coverage gap to file as an issue.
Don’t expose source.html to end users
source.html is a developer/admin artifact. It contains the unsanitised converter output (including any prompt-injected content if the source PDF was hostile). The export route requires authentication and ownership, but the artifact itself isn’t styled for users — keep it scoped to debugging tools.
Related
- #525 — original tracking issue (“persist pre-validator HTML for post-mortem debugging”)
- #529 — closed coverage gap in struct-table conversion path
- #527 — equation rendering resilience (used
source.htmlto diagnose Claude vision variation) - #530, #531, #532 — image and table issues diagnosed from the
source.htmlfor file11c5124a-6048-4390-9bd3-e93affa0f7fd