PDF Remediation Pipeline & Conformance Report
How the platform takes raw HTML / source documents, produces a tagged PDF, and reports independent and self-attested conformance against the relevant accessibility standards.
Companion to pdf-accessibility-score.md,
which covers the user-facing 0β100 score in depth. This doc is the
operator/engineer view of the whole pipeline.
1. Standards in scope
| Standard | Role | How we satisfy it |
|---|---|---|
| WCAG 2.1 AA | The legal baseline (ADA Title II 2026-07663, Section 508, EU EN 301 549) | Structural checks + content cleanup. Each check is mapped to the SC(s) it supports. |
| PDF/UA-1 (ISO 14289-1) | Standard implementation recipe for applying WCAG to PDFs | XMP pdfuaid:part="1" claim + structure tree + tags, validated by veraPDF |
| Matterhorn Protocol 1.1 | The 136-failure-condition checklist that operationalizes PDF/UA-1 | Machine-checkable conditions are caught by veraPDF; the ~20% human conditions (alt-text correctness, reading-order sense) are out of scope for any automated scorer |
Bottom line: WCAG 2.1 AA is what the law requires. PDF/UA-1 is how we satisfy it at the PDF format level. Matterhorn is the test plan that proves PDF/UA-1 conformance.
2. The pipeline (HTML β conformant PDF)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ source (PDF / DOCX / image / etc.) ββ β ββ βΌ ββ conversion cascade ββ produces semantic HTML ββ β ββ βΌ ββ weasyprint-generator.ts ββ HTML β tagged PDF ββ β ββ βΌ ββ pdf-accessibility-postprocessor.ts ββ 11 PDF/UA fixes ββ β ββ βΌ ββ verapdf-client.ts ββ ISO 14289-1 reference validator (soft-fail) ββ β ββ βΌ ββ pdf-accessibility-scorer.ts ββ 16 structural checks β 0β100 ββ β ββ βΌ ββ R2 / S3 + DB row (files.accessible_pdf_*) βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ2.1 WeasyPrint (workers/api/src/services/weasyprint-generator.ts)
WeasyPrint is the only HTML-to-PDF engine on the market that produces
correctly tagged output by default. Chromiumβs tagged-PDF mode wraps
unrecognised elements in <NonStruct>, which fails PDF/UA. WeasyPrint
maps semantic HTML (h1βh6, p, ul, ol, table, figure,
section, article, main, nav) to the correct PDF structure
types directly.
What WeasyPrint doesnβt do:
- Set
/DisplayDocTitleinViewerPreferences - Set
/Tabs /Son every page (structure-based tab order) - Generate a document outline (bookmarks)
- Inject XMP with the PDF/UA-1 claim
- Mark out-of-tag artifacts (page numbers, headers/footers) explicitly
- Add
/Contentsto link annotations - Inject
/Alton/Figurestructure elements - Inject
/Scopeon/THcells - Normalize bullet labels to a Unicode bullet
All of those land on the post-processor.
The WeasyPrint sidecar runs at weasyprint:5001 on the pdf-net
Docker network and is built from services/weasyprint/.
2.2 Post-processor (pdf-accessibility-postprocessor.ts)
Eleven discrete fixes, all using pdf-lib to mutate the document
object graph. Together they push the structural score from ~35% (raw
WeasyPrint output) to 80%+. They mirror the fixups Adobe Acrobatβs
accessibility preflight applies.
| # | Fix | Why it matters |
|---|---|---|
| 1 | /Title on document info dictionary | Required by PDF/UA; assistive tech reads it as the document name |
| 2 | /Lang on document catalog | Required by PDF/UA; sets pronunciation for screen readers |
| 3 | /DisplayDocTitle true in ViewerPreferences | Tells PDF readers to show the title instead of the filename |
| 4 | /Tabs /S on every page | Tab key follows structure order, not annotation order |
| 5 | XMP metadata (dc:title, dc:language, pdfuaid:part=β1β, producer) | The conformance claim itself; ISO 16684-1 metadata |
| 6 | Bookmarks from heading structure | Required for documents over a few pages (WCAG 2.4.5 Multiple Ways) |
| 7 | Mark untagged content as /Artifact | Page numbers, headers, footers stay out of reading order |
| 8 | Normalize list bullet labels | Replaces engine-specific glyphs with the Unicode bullet |
| 9 | /Contents on link annotations | Acrobat fixup #3; gives screen readers the linkβs spoken name |
| 10 | /Alt on /Figure structure elements | Pulls alt text from source HTML and attaches it to the PDF figure |
| 11 | /Scope on /TH table headers | Marks each header as /Row or /Column (Matterhorn 15-003) |
Important caveat: the post-processor asserts PDF/UA-1 in XMP. It does not prove it. Thatβs veraPDFβs job (next step).
2.3 veraPDF (services/verapdf/, workers/api/src/services/verapdf-client.ts) β issue #506
veraPDF is the ISO reference implementation of PDF/UA-1 (and PDF/A) validation. It runs every machine-checkable Matterhorn condition against the bytes and returns a JSON report.
Architecture:
- Sidecar:
services/verapdf/{Dockerfile, server.py}β Flask process wrapping theverapdfCLI on top of the upstreamverapdf/cliimage. Listens on container-internal port5002. Reachable from the API ashttp://verapdf:5002overpdf-net. - Client:
validatePdfUA1(pdf: Uint8Array): Promise<VerapdfResult>. Streams the PDF body to/validate?flavour=ua1, parses the report, returns{ passed, failedRules[], durationMs }. - Wiring: called in all four export call sites between
postProcessAccessiblePdfandscorePdfAccessibility:server.ts,index-aws.ts,routes/convert.ts,scheduler/chunk-scheduler.ts.
Soft-fail by design. A veraPDF outage, timeout, or unexpected
report shape MUST NOT block PDF delivery. The catch logs [verapdf] validation skipped: β¦ and the export proceeds. This is the launch
policy (issue #506) β promote to hard-fail only after the failure rate
stabilizes near zero.
veraPDF only runs where the sidecar is reachable:
| Runtime | Reachable? | Behavior |
|---|---|---|
| Node fleet on 10.1.1.4 (api-node-1, api-node-2, batch-worker) | Yes β same Docker network | Validates and persists summary |
AWS Lambda (index-aws.ts β api-pdf.theaccessible.org) | No | Soft-skip; no harm. Lambda almost never runs heavy export anyway |
| EC2 spot fleet | Currently desired=0 | When re-enabled, AMI compose file must include the verapdf service |
2.4 Structural scorer (pdf-accessibility-scorer.ts)
Runs 16 byte-level structural checks (no AI, no rendering). Each check
is mapped to the WCAG 2.1 AA success criteria it supports. Scores
0β100; banded display. Full check list is documented in
pdf-accessibility-score.md. The
scorer is independent of veraPDF β they cross-check each other.
3. What we test for, mapped to standards
3.1 Tier 0 β Structural presence (always run, βquickβ mode)
| Check | Asks | Fail mode |
|---|---|---|
extractable_text | Does the PDF contain real text, not just rasterized images? | Scanned PDFs, image-only exports |
tag_structure | Is there a /StructTreeRoot? | Untagged PDFs |
image_alt_text | Do /Figure elements have /Alt? | Decorative-only PDFs, missing alt |
document_language | /Lang on the catalog? | Missing language |
document_title | /Title in the info dict? | Engine left it as βuntitledβ |
table_headers | Are /TH elements present where /Table exists? | Tables that use /TD for headers |
3.2 Tier 1 + 2 β Correctness + PDF/UA (full mode adds 10 more)
heading_hierarchy, reading_order, table_header_scope,
list_structure, link_annotations, pdfua_identifier, tab_order,
bookmarks, artifact_marking, display_doc_title.
See WCAG_CRITERIA_MAP in pdf-accessibility-scorer.ts for the SC
mapping. See pdf-accessibility-score.md for the per-check rubric and
deductions.
3.3 What veraPDF adds on top
veraPDF runs the machine-checkable subset of Matterhorn 1.1 β about 80% of the 136 failure conditions. Examples our scorer doesnβt specifically check but veraPDF will:
- 09-001..09-008: Structure type mapping integrity (every used type
must be defined in
/RoleMapor be a standard type) - 02-001: Document permissions donβt suppress assistive tech
- 06-002..06-004: Embedded font character mapping completeness
- 14-001..14-007: Rich annotation requirements
- 19-001..19-006: PDF version + structure tree consistency rules
- ISO 32000-1 syntax conformance that PDF/UA inherits
Where the two disagree, veraPDF is authoritative.
3.4 What no automated tool can check
Roughly 20% of Matterhorn is βhuman-onlyβ β judgement-based:
- Is the alt text actually descriptive of the imageβs meaning?
- Does the reading order make sense to a human reader?
- Are decorative images correctly marked decorative (not load-bearing ones marked decorative to silence the validator)?
- Are equations spoken the way a sighted reader sees them?
- Is the tagging of complex multi-column or multi-table layouts semantically right, even if structurally valid?
veraPDF flags these as βneeds human reviewβ rather than pass/fail. For documents that need to defend a conformance claim (e.g., legal filings, course materials shipped to LMS), a human pass is still required. The platform accelerates that pass β it doesnβt replace it.
4. Interpreting the conformance report
The dashboard accessibility report (apps/web preview / pipeline /
report pages, plus the LTI course-scanner) shows:
- Score (0β100) + band β from the structural scorer
- 16 individual check results β each with pass/fail, deduction, and the WCAG SCs it maps to
- veraPDF summary (when available) β
passed/failed-rule count, plus the failed-rule details (clause, test number, description, occurrences)
4.1 The four interpretation cases
| Score | veraPDF | Meaning | Action |
|---|---|---|---|
| β₯ 80 | passed | Strong evidence of conformance. Both independent checks agree. | Ship. Manual sample review for high-stakes docs. |
| β₯ 80 | failed | Score over-reports. veraPDF found PDF/UA violations our scorer doesnβt catch (missing role-mapping, font-cmap issue, etc.). | Investigate failed rules. Often a post-processor bug or an upstream HTML quirk. |
| < 80 | passed | Rare. Usually means the post-processor did the structural minimum but the HTML lacked content (no alt text, no headings) β veraPDF doesnβt grade content quality, just byte conformance. | Improve source HTML; rerun. |
| < 80 | failed | Both agree the document isnβt ready. Low-effort wins are usually in image_alt_text, heading_hierarchy, bookmarks. | Fix post-processor failures first; rerun veraPDF. |
| any | skipped | Sidecar unreachable or timed out. Score still valid. | Check [verapdf] log line; restart verapdf container if down. |
4.2 Reading veraPDF failed rules
Each failed rule has:
clauseβ the spec section, e.g.ISO 14289-1testNumberβ Matterhorn test ID, e.g.7.1-2descriptionβ human-readable ruleoccurrencesβ how many times this rule fired in the document
Map test numbers to Matterhorn checkpoints via the
Matterhorn Protocol 1.1 PDF.
The clause group (7.1, 7.18, etc.) matches the ISO 14289-1 section
that defines the requirement.
4.3 Headers exposed by HTTP exports
The two HTTP-streaming endpoints (/html-to-pdf on server.ts and on
index-aws.ts) expose results as response headers, since thereβs no
DB row to attach to:
X-PDF-Accessibility-Score: 87X-PDF-Accessibility-Band: goodX-PDF-Accessibility-Details: <base64 JSON of check array>X-PDF-Verapdf-Passed: true|false(only when veraPDF ran)X-PDF-Verapdf-Failed-Rules: 0(only when veraPDF ran)
4.4 Persisted columns
The async paths (routes/convert.ts, scheduler/chunk-scheduler.ts)
write to public.files:
accessible_pdf_scoreβ integeraccessible_pdf_score_detailsβ JSONB array of check resultsaccessible_pdf_verapdfβ JSONB{ passed, failedRules[], durationMs }
These power the dashboardβs accessibility-report rendering.
5. Operational notes
5.1 When veraPDF goes down
Symptoms: [verapdf] validation skipped: β¦ log lines on every
export, accessible_pdf_verapdf stays NULL on new files. Score is
unaffected β exports continue. Fix:
cd ~/accessibledocker compose ps verapdfdocker compose logs --tail 100 verapdfdocker compose restart verapdf # or up -d --force-recreate verapdf5.2 When the score and veraPDF disagree
The scorer is a fast, cheap heuristic β pattern matching against PDF bytes. veraPDF is the slow, correct, ISO reference. When they disagree, the scorer is wrong (in the precision sense). Open an issue with both reports attached so we can either tighten the structural check or accept the divergence as a known limitation of byte-pattern scoring.
5.3 Promoting veraPDF from soft-fail to hard-fail
Per issue #506, the launch plan is:
- Ship soft-fail (current state)
- Watch the failure rate for ~1 week β both veraPDF outage rate AND PDF/UA failure rate among successful runs
- Once the failure-rate metric stabilizes near zero, promote to
hard-fail by changing the soft-skip catch into an
accessiblePdfStatus = 'failed'write
5.4 What the platform deliberately does NOT do
- We donβt score the visual fidelity of the PDF here
- We donβt run a Matterhorn human-pass β thatβs a service offering, not an automated check
- We donβt validate PDF/UA-2 (veraPDF canβt yet, and our XMP claims PDF/UA-1)
- We donβt validate PDF/A flavours by default (the sidecar supports
it β
flavour=1betc. β but no caller enables that)
6. Source-of-truth pointers
| Topic | File |
|---|---|
| WeasyPrint client | workers/api/src/services/weasyprint-generator.ts |
| Post-processor | workers/api/src/services/pdf-accessibility-postprocessor.ts |
| Structural scorer | workers/api/src/services/pdf-accessibility-scorer.ts |
| veraPDF client | workers/api/src/services/verapdf-client.ts |
| veraPDF sidecar | services/verapdf/{Dockerfile,server.py,requirements.txt} |
| Score β WCAG mapping | WCAG_CRITERIA_MAP in scorer |
| Shared types | packages/shared/src/types.ts (AccessibilityScore, VerapdfScoreSummary) |
| Compose | docker-compose.yml (services weasyprint, verapdf) |
| DB columns | migration 20260403_059_accessible_pdf_export.sql (score) + 20260505_096_files_accessible_pdf_verapdf.sql (verapdf summary) |