source.html — Debugging Converter Output

For every successful conversion the platform stores three HTML artifacts in R2:

Artifact	Path	What it is
`index.html`	`users/{userId}/output/{fileId}/index.html`	Final user-facing HTML — after all post-processing (ux-optimizer, mathml-validator, axe-fix-loop, visual-polish, etc.). This is what the dashboard preview and the WeasyPrint pipeline see.
`ir.xhtml`	`users/{userId}/output/{fileId}/ir.xhtml`	XHTML 1.0 Strict version of the same content (EPUB-ready).
`source.html`	`users/{userId}/output/{fileId}/source.html`	Raw converter output before any post-processing. This is what Claude vision / Mathpix / Marker actually emitted, before the deterministic clean-up passes mutated it.

source.html is the most important debugging artifact in the system. Use it whenever a conversion’s output looks wrong.

Why source.html exists

The post-processing passes are defensive — they catch malformed output and try to make it presentable. That defense erases evidence of the original bug:

mathml-validator wraps <math>raw LaTeX</math> blocks in <code class="math-fallback"> — at which point you can no longer tell whether the converter emitted the bad math or whether some upstream step corrupted it.
wrapBareLatex wraps any plaintext with two or more LaTeX commands in the same <code class="math-fallback"> shape — the original surrounding context is gone.
convertBareLatexToMath (#527) promotes bare LaTeX to a real <math> element — useful, but means you can’t tell from index.html whether Claude emitted the <math> or whether we synthesised it.
Image extraction renames files to images/page-N-img-M.png — if Claude referenced images/foo.png in its raw markup, that’s lost.

source.html is captured before any of those passes run, so it shows exactly the shape Claude (or whatever converter ran) produced. Most converter regressions are diagnosed by diffing source.html against index.html.

How to retrieve source.html

Option 1 — REST API (recommended)

The export router exposes it at:

GET /api/export/source-html/:fileId

Authenticated as the file’s owner. Returns the raw HTML with proper MIME type, plus the standard Cache-Control and CORS headers.

Option 2 — Direct from R2 (admin)

Build the path with the shared helper:

import { R2_PATHS } from '@accessible-pdf/shared';
const key = R2_PATHS.sourceHtml(userId, fileId);
// users/{userId}/output/{fileId}/source.html

Then pull via the AWS S3 client pointed at the R2 endpoint:

ssh -i ~/.ssh/nightly-audit [email protected] \
  "AWS_ACCESS_KEY_ID=$R2_ACCESS_KEY_ID \
   AWS_SECRET_ACCESS_KEY=$R2_SECRET_ACCESS_KEY \
   aws s3 cp \
     s3://accessible-pdf-files/users/{userId}/output/{fileId}/source.html \
     /tmp/source.html \
     --endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com"

R2 credentials live in /home/larry/accessible/.env.node-server on 10.1.1.4.

Option 3 — From a recent conversion (quick check)

List artifacts for any file:

aws s3 ls \
  s3://accessible-pdf-files/users/{userId}/output/{fileId}/ \
  --endpoint-url https://{accountId}.r2.cloudflarestorage.com

Expected listing: accessible.pdf, cost-analysis.html, index.html, ir.xhtml, source.html, styled.html, plus per-conversion images under assets/.

If source.html is missing for a recent conversion, file an issue — its absence is the bug.

How to read it

source.html is what Claude / Mathpix / Marker actually emitted, not a debug rendering. It is valid HTML and can be opened directly in a browser. Diff it against index.html to see what each post-processing pass changed:

diff <(curl -s .../source.html) <(curl -s .../index.html) | less

Common diagnoses you can read directly from source.html:

Symptom in `index.html`	What to check in `source.html`
`<code class="math-fallback">` wrapping a math element	Did the converter emit `<math>raw LaTeX</math>` (validator wrapped it) or bare `\sum…\binom…` text in a `<p>` (wrapBareLatex wrapped it)?
Image at the wrong location or missing	What `src` attribute does the source `<img>` use? Does the file actually exist in `uploads/{fileId}/images/`?
Table rendered as a list of `<p>` paragraphs	Did the converter emit a `<table>` at all, or did it flatten the layout to flowing text?
Math renders wrong only in the PDF	Inspect both `source.html` (Claude’s output) and check the converter’s MathML shape — quirky attributes like `tml-med-pad` can survive the HTML but trip the Node-side prerender.

When the converter pipeline isn’t chunked-vision

source.html is written by every conversion path that goes through storeIrAndHtml — including chunk-assembler (chunked-vision) and the struct-table route (#525, fixed in #529). Mathpix and pure-Marker conversions that take a different storage path may not produce a source.html. If you’re investigating a non-vision conversion and there’s no source.html, that’s a coverage gap to file as an issue.

Don’t expose source.html to end users

source.html is a developer/admin artifact. It contains the unsanitised converter output (including any prompt-injected content if the source PDF was hostile). The export route requires authentication and ownership, but the artifact itself isn’t styled for users — keep it scoped to debugging tools.

#525 — original tracking issue (“persist pre-validator HTML for post-mortem debugging”)
#529 — closed coverage gap in struct-table conversion path
#527 — equation rendering resilience (used source.html to diagnose Claude vision variation)
#530, #531, #532 — image and table issues diagnosed from the source.html for file 11c5124a-6048-4390-9bd3-e93affa0f7fd