PDF Accessibility Score

This document explains the accessibility score shown in apps/web (e.g. on the preview, pipeline, report, and LTI course-scanner pages) for every PDF the app ingests or produces.

Compliance framing — WCAG 2.1 AA is the baseline, not PDF/UA

The laws in scope (ADA Title II DOJ rule 2026-07663, Section 508 Revised Standards, EU EN 301 549) all adopt WCAG 2.1 Level AA as the normative technical standard for web content and PDFs. PDF/UA-1 (ISO 14289-1) and the Matterhorn Protocol 1.1 are the standard implementation / validation checklists for applying WCAG to PDFs, but they are not themselves what the law requires.

Our scorer reflects this:

The headline score and each of the 16 structural checks are labeled with the WCAG 2.1 AA success criteria they support (see WCAG_CRITERIA_MAP in pdf-accessibility-scorer.ts).
The underlying byte-level checks derive from PDF/UA-1 / Matterhorn, which is how those SCs are satisfied at the PDF-format level.
Passing a structural check is evidence toward — not proof of — SC conformance. We do not stamp a PDF/UA-1 claim just because the structural checks pass; that needs a Matterhorn-complete validator (veraPDF / PAC 2024).

1. What the score is

A single integer 0–100 representing how well the PDF meets a set of structural accessibility checks. It is a structural / PDF-UA-oriented score, not a full WCAG page-content audit.

Implementation: workers/api/src/services/pdf-accessibility-scorer.ts (function scorePdfAccessibility)
Shared types / banding: packages/shared/src/types.ts (AccessibilityScore, AccessibilityCheckResult, getScoreBand)
No AI, no rendering. The scorer reads raw PDF bytes as Latin-1, counts dictionary patterns (/StructTreeRoot, /S /Figure, /MCID …, etc.) and deducts points per failing check.

Score bands (`getScoreBand`)

Range	Band	Meaning
0–33	`red`	Not usable with assistive tech
34–66	`orange`	Partial — major gaps
67–99	`green`	Good — minor gaps
100	`dark-green`	All structural checks pass

The 50% scores you typically see sit in the orange band — the PDF has text but is missing most of the PDF/UA structural metadata.

Two scoring modes

scorePdfAccessibility(bytes, { mode }) runs either:

quick — 6 structural presence checks (Tier 0). Used for freemium intake and LTI course scans (workers/api/src/routes/lti-course.ts:157).
full — all 16 checks (Tier 0 + Tier 1 correctness + Tier 2 PDF/UA). Used on generated PDFs before export in workers/api/src/routes/convert.ts:281 and workers/api/src/scheduler/chunk-scheduler.ts:806.

A source PDF scored with quick and the same PDF scored with full will produce different numbers — the full score is stricter because it runs more checks.

2. How it is calculated

score = max(0, 100 − sum(deductions from each failing check))

Each check has a weight (the maximum it can deduct) and a deduction (what it actually took off this run). Checks that don’t apply (e.g. no tables, no images) pass for free with deduction 0.

Tier 0 — Structural presence (always run, both modes)

#	Check ID	Weight	Deduction rule
1	`extractable_text`	40	−40 if fewer than 3 `Tj`/`TJ` text-showing operators found (scanned/image-only PDF)
2	`tag_structure`	20	−20 if `/MarkInfo` or `/StructTreeRoot` is missing
3	`image_alt_text`	30	−15 per image missing `/Alt` inside a `/S /Figure` region, capped at −30
4	`document_language`	10	−10 if no `/Lang` entry (or value shorter than 2 chars)
5	`document_title`	5	−5 if `/Title` missing or empty in the info dictionary
6	`table_headers`	10	−10 if tables exist (`/S /Table`) but no `/S /TH` header cells anywhere

Max Tier-0 deduction: 115. Because the score is floored at 0, failing just the first two (no text + no tags) already pins the score to ≤40. This is why scanned PDFs score so low.

Tier 1 — Structure correctness (full mode only)

#	Check ID	Weight	WCAG SC	Deduction rule
7	`heading_hierarchy`	10	1.3.1, 2.4.6, 2.4.10	−5 for one issue (no H1, or one skipped level); −10 for two or more
8	`reading_order`	15	1.3.2	−15 if no `/StructTreeRoot`; −15 if `/StructTreeRoot` has no `/ParentTree`; −10 if pages have no `/StructParents`
9	`table_header_scope`	10	1.3.1	−5 if more than half of `/S /TH` cells have `/Scope`; −10 if fewer
10	`list_structure`	5	1.3.1	−5 if list tags (`/S /L`) exist with no list items (`/S /LI`)
11	`link_annotations`	5	1.3.1, 2.4.4, 4.1.2	−5 if the number of `/Subtype /Link` annotations exceeds `/S /Link` structure tags

Reading-order change (2026-04-23): this check used to count MCID integer inversions, which false-positived on multi-column and floated-figure layouts (valid PDFs with MCIDs emitted out of visual order but read correctly via the structure tree). It now verifies that the structure tree defines the reading order per ISO 32000-1 §14.7: /StructTreeRoot contains a /ParentTree, and pages reference it via /StructParents.

Tier 2 — PDF/UA compliance (full mode only)

#	Check ID	Weight	Deduction rule
12	`pdfua_identifier`	10	−10 if XMP metadata does not contain `pdfuaid:part`
13	`tab_order`	5	−5 if any page lacks `/Tabs /S` (fewer `/Tabs /S` occurrences than estimated pages)
14	`bookmarks`	5	−5 if the document has headings but no `/Type /Outlines` (bookmarks) entry
15	`artifact_marking`	5	−5 if a multi-page document has zero `/Artifact` markers (headers/footers/page numbers will leak into reading order)
16	`display_doc_title`	3	−3 if `/DisplayDocTitle true` is missing from `ViewerPreferences`

Worked example — why a healthy-looking PDF scores ~50

A PDF from Word that has text + headings but no tag tree typically fails:

tag_structure (−20)
image_alt_text (−30 if images)
document_language (−10)
document_title (−5)
In full mode, also pdfua_identifier (−10), tab_order (−5), bookmarks (−5), artifact_marking (−5), display_doc_title (−3), reading_order cascade failure (−15)

Quick mode: 100 − 65 = 35. Full mode: 100 − 108, floored = 0. An “okay but untagged” PDF is the 50-ish case — some of the above fail, most Tier-1 pass.

Known scorer caveats

Byte-level string matching can miss checks that live inside compressed object streams. A PDF that is genuinely tagged but uses /ObjStm compression may score lower than it deserves. (pdf-struct-cleaner.ts:67 calls this out.)
estimatePageCount counts /Type /Page occurrences — a few edge-case PDFs can over- or under-count, which feeds into tab_order and artifact_marking.
document_title, document_language, pdfua_identifier, and display_doc_title use substring searches against the first 5 MB of the file — large PDFs with late-file metadata may miss (maxAnalyzeBytes default = 5 MB).

3. How we improve the score

We already run a post-processor on every generated PDF that targets exactly these checks:

workers/api/src/services/pdf-accessibility-postprocessor.ts → postProcessAccessiblePdf is invoked from server.ts:698, index-aws.ts:470, chunk-scheduler.ts:794, and routes/convert.ts:269. It fixes: XMP metadata, /DisplayDocTitle, /Tabs /S, outline/bookmarks, /Lang, /Title, artifact marking, bullet normalization, and link /Contents. Per its own header comment, it “pushes scores from ~35% to 80%+”.

Leverage per check (highest-impact first)

For source PDFs (what the user uploaded — we don’t control these, but the score tells the user why remediation is needed):

extractable_text (−40) — OCR the PDF. This is the single biggest lever. Scanned PDFs can’t score above 60 no matter what else we do.
tag_structure (−20) — run through our conversion pipeline; WeasyPrint + post-processor adds the tag tree.
image_alt_text (−30) — AI-generated alt text is already part of the pipeline (image-extractor.ts, image description prompts). Verify every <img> in the converted HTML has meaningful alt="".

For our generated PDFs (what we ship to the user), the remaining gaps to close:

pdfua_identifier (−10) — the postprocessor header explicitly says it does not claim PDF/UA-1. To claim it we must also verify all Tier-1 checks pass; adding the XMP claim alone without passing structure checks fails Acrobat’s preflight.
reading_order (−15) — ensure WeasyPrint emits MCIDs in visual order. Multi-column layouts and floated figures are the usual culprits. Audit by opening a generated PDF in Acrobat Pro → Accessibility → Reading Order.
table_header_scope (−10) — emit <th scope="col"> / scope="row" in the converted HTML so WeasyPrint propagates /Scope into the PDF tag tree.
heading_hierarchy (−5 to −10) — TOC detection already produces an H1; make sure downstream chunks don’t restart at H1 or skip from H1 → H3.
list_structure (−5) — ensure <ul>/<ol> survive conversion; don’t emit bare <p>• item</p>.
link_annotations (−5) — verify every <a href> in the HTML becomes both a /Subtype /Link annotation and a /S /Link structure element.
artifact_marking (−5) — the postprocessor’s “mark untagged content as Artifact” fixup handles this; check it’s running for multi-page outputs.
document_title, document_language, display_doc_title, bookmarks, tab_order — all fully handled by the postprocessor today. If one regresses, check postProcessAccessiblePdf is being called on that code path.

Implemented follow-ups (2026-04-23)

✅ Per-check breakdown surfaced in the UI. Each AccessibilityCheckResult now carries a wcagCriteria: string[], and apps/web/src/components/lti/ScoreBreakdown.tsx renders failing checks with their WCAG 2.1 AA SC tags. Wired into FileScoreRow via a “Why?” disclosure.
✅ Round-trip regression test. workers/api/src/__tests__/services/pdf-accessibility-roundtrip.test.ts builds synthetic “before” PDFs with pdf-lib, runs postProcessAccessiblePdf, then re-scores — asserting the score strictly increases, no check regresses from pass to fail, and total deductions monotonically decrease.
✅ StructParents-based reading-order check. checkReadingOrder no longer uses MCID monotonicity. It now validates that the structure tree actually expresses reading order (via /ParentTree + /StructParents), which is what ISO 32000-1 §14.7 defines as authoritative.

Remaining follow-ups

Gate the PDF/UA-1 claim. The post-processor currently stamps <pdfuaid:part>1</pdfuaid:part> in XMP unconditionally (see injectXmpMetadata in pdf-accessibility-postprocessor.ts). It should only be stamped when all 16 structural checks pass — otherwise downstream validators (veraPDF, Acrobat Preflight) will flag a false claim.
Run veraPDF / PAC 2024 in CI on a representative output as a Matterhorn-complete external check. Our scorer is a subset.
Add a WCAG 2.2 AA pass alongside 2.1 AA — backward-compatible, forward-looking for procurement language.
Uncompress /ObjStm before scoring so tag presence isn’t missed when writers use object-stream compression.

4. Quick reference — file map

Concern	File
Scoring logic (16 checks, deductions)	`workers/api/src/services/pdf-accessibility-scorer.ts`
Score-band thresholds + shared types	`packages/shared/src/types.ts` (`getScoreBand`, `AccessibilityScore`)
PDF fixups that raise the score	`workers/api/src/services/pdf-accessibility-postprocessor.ts`
Full-mode scoring call sites	`routes/convert.ts:281`, `scheduler/chunk-scheduler.ts:806`
Quick-mode scoring call site (LTI scan)	`routes/lti-course.ts:157`
Tests	`workers/api/src/__tests__/services/pdf-accessibility-scorer.test.ts`