Skip to content

PDF Accessibility Score

This document explains the accessibility score shown in apps/web (e.g. on the preview, pipeline, report, and LTI course-scanner pages) for every PDF the app ingests or produces.

Compliance framing β€” WCAG 2.1 AA is the baseline, not PDF/UA

The laws in scope (ADA Title II DOJ rule 2026-07663, Section 508 Revised Standards, EU EN 301 549) all adopt WCAG 2.1 Level AA as the normative technical standard for web content and PDFs. PDF/UA-1 (ISO 14289-1) and the Matterhorn Protocol 1.1 are the standard implementation / validation checklists for applying WCAG to PDFs, but they are not themselves what the law requires.

Our scorer reflects this:

  • The headline score and each of the 16 structural checks are labeled with the WCAG 2.1 AA success criteria they support (see WCAG_CRITERIA_MAP in pdf-accessibility-scorer.ts).
  • The underlying byte-level checks derive from PDF/UA-1 / Matterhorn, which is how those SCs are satisfied at the PDF-format level.
  • Passing a structural check is evidence toward β€” not proof of β€” SC conformance. We do not stamp a PDF/UA-1 claim just because the structural checks pass; that needs a Matterhorn-complete validator (veraPDF / PAC 2024).

1. What the score is

A single integer 0–100 representing how well the PDF meets a set of structural accessibility checks. It is a structural / PDF-UA-oriented score, not a full WCAG page-content audit.

  • Implementation: workers/api/src/services/pdf-accessibility-scorer.ts (function scorePdfAccessibility)
  • Shared types / banding: packages/shared/src/types.ts (AccessibilityScore, AccessibilityCheckResult, getScoreBand)
  • No AI, no rendering. The scorer reads raw PDF bytes as Latin-1, counts dictionary patterns (/StructTreeRoot, /S /Figure, /MCID …, etc.) and deducts points per failing check.

Score bands (getScoreBand)

RangeBandMeaning
0–33redNot usable with assistive tech
34–66orangePartial β€” major gaps
67–99greenGood β€” minor gaps
100dark-greenAll structural checks pass

The 50% scores you typically see sit in the orange band β€” the PDF has text but is missing most of the PDF/UA structural metadata.

Two scoring modes

scorePdfAccessibility(bytes, { mode }) runs either:

  • quick β€” 6 structural presence checks (Tier 0). Used for freemium intake and LTI course scans (workers/api/src/routes/lti-course.ts:157).
  • full β€” all 16 checks (Tier 0 + Tier 1 correctness + Tier 2 PDF/UA). Used on generated PDFs before export in workers/api/src/routes/convert.ts:281 and workers/api/src/scheduler/chunk-scheduler.ts:806.

A source PDF scored with quick and the same PDF scored with full will produce different numbers β€” the full score is stricter because it runs more checks.


2. How it is calculated

score = max(0, 100 βˆ’ sum(deductions from each failing check))

Each check has a weight (the maximum it can deduct) and a deduction (what it actually took off this run). Checks that don’t apply (e.g. no tables, no images) pass for free with deduction 0.

Tier 0 β€” Structural presence (always run, both modes)

#Check IDWeightDeduction rule
1extractable_text40βˆ’40 if fewer than 3 Tj/TJ text-showing operators found (scanned/image-only PDF)
2tag_structure20βˆ’20 if /MarkInfo or /StructTreeRoot is missing
3image_alt_text30βˆ’15 per image missing /Alt inside a /S /Figure region, capped at βˆ’30
4document_language10βˆ’10 if no /Lang entry (or value shorter than 2 chars)
5document_title5βˆ’5 if /Title missing or empty in the info dictionary
6table_headers10βˆ’10 if tables exist (/S /Table) but no /S /TH header cells anywhere

Max Tier-0 deduction: 115. Because the score is floored at 0, failing just the first two (no text + no tags) already pins the score to ≀40. This is why scanned PDFs score so low.

Tier 1 β€” Structure correctness (full mode only)

#Check IDWeightWCAG SCDeduction rule
7heading_hierarchy101.3.1, 2.4.6, 2.4.10βˆ’5 for one issue (no H1, or one skipped level); βˆ’10 for two or more
8reading_order151.3.2βˆ’15 if no /StructTreeRoot; βˆ’15 if /StructTreeRoot has no /ParentTree; βˆ’10 if pages have no /StructParents
9table_header_scope101.3.1βˆ’5 if more than half of /S /TH cells have /Scope; βˆ’10 if fewer
10list_structure51.3.1βˆ’5 if list tags (/S /L) exist with no list items (/S /LI)
11link_annotations51.3.1, 2.4.4, 4.1.2βˆ’5 if the number of /Subtype /Link annotations exceeds /S /Link structure tags

Reading-order change (2026-04-23): this check used to count MCID integer inversions, which false-positived on multi-column and floated-figure layouts (valid PDFs with MCIDs emitted out of visual order but read correctly via the structure tree). It now verifies that the structure tree defines the reading order per ISO 32000-1 Β§14.7: /StructTreeRoot contains a /ParentTree, and pages reference it via /StructParents.

Tier 2 β€” PDF/UA compliance (full mode only)

#Check IDWeightDeduction rule
12pdfua_identifier10βˆ’10 if XMP metadata does not contain pdfuaid:part
13tab_order5βˆ’5 if any page lacks /Tabs /S (fewer /Tabs /S occurrences than estimated pages)
14bookmarks5βˆ’5 if the document has headings but no /Type /Outlines (bookmarks) entry
15artifact_marking5βˆ’5 if a multi-page document has zero /Artifact markers (headers/footers/page numbers will leak into reading order)
16display_doc_title3βˆ’3 if /DisplayDocTitle true is missing from ViewerPreferences

Worked example β€” why a healthy-looking PDF scores ~50

A PDF from Word that has text + headings but no tag tree typically fails:

  • tag_structure (βˆ’20)
  • image_alt_text (βˆ’30 if images)
  • document_language (βˆ’10)
  • document_title (βˆ’5)
  • In full mode, also pdfua_identifier (βˆ’10), tab_order (βˆ’5), bookmarks (βˆ’5), artifact_marking (βˆ’5), display_doc_title (βˆ’3), reading_order cascade failure (βˆ’15)

Quick mode: 100 βˆ’ 65 = 35. Full mode: 100 βˆ’ 108, floored = 0. An β€œokay but untagged” PDF is the 50-ish case β€” some of the above fail, most Tier-1 pass.

Known scorer caveats

  • Byte-level string matching can miss checks that live inside compressed object streams. A PDF that is genuinely tagged but uses /ObjStm compression may score lower than it deserves. (pdf-struct-cleaner.ts:67 calls this out.)
  • estimatePageCount counts /Type /Page occurrences β€” a few edge-case PDFs can over- or under-count, which feeds into tab_order and artifact_marking.
  • document_title, document_language, pdfua_identifier, and display_doc_title use substring searches against the first 5 MB of the file β€” large PDFs with late-file metadata may miss (maxAnalyzeBytes default = 5 MB).

3. How we improve the score

We already run a post-processor on every generated PDF that targets exactly these checks:

  • workers/api/src/services/pdf-accessibility-postprocessor.ts β†’ postProcessAccessiblePdf is invoked from server.ts:698, index-aws.ts:470, chunk-scheduler.ts:794, and routes/convert.ts:269. It fixes: XMP metadata, /DisplayDocTitle, /Tabs /S, outline/bookmarks, /Lang, /Title, artifact marking, bullet normalization, and link /Contents. Per its own header comment, it β€œpushes scores from ~35% to 80%+”.

Leverage per check (highest-impact first)

For source PDFs (what the user uploaded β€” we don’t control these, but the score tells the user why remediation is needed):

  1. extractable_text (βˆ’40) β€” OCR the PDF. This is the single biggest lever. Scanned PDFs can’t score above 60 no matter what else we do.
  2. tag_structure (βˆ’20) β€” run through our conversion pipeline; WeasyPrint + post-processor adds the tag tree.
  3. image_alt_text (βˆ’30) β€” AI-generated alt text is already part of the pipeline (image-extractor.ts, image description prompts). Verify every <img> in the converted HTML has meaningful alt="".

For our generated PDFs (what we ship to the user), the remaining gaps to close:

  1. pdfua_identifier (βˆ’10) β€” the postprocessor header explicitly says it does not claim PDF/UA-1. To claim it we must also verify all Tier-1 checks pass; adding the XMP claim alone without passing structure checks fails Acrobat’s preflight.
  2. reading_order (βˆ’15) β€” ensure WeasyPrint emits MCIDs in visual order. Multi-column layouts and floated figures are the usual culprits. Audit by opening a generated PDF in Acrobat Pro β†’ Accessibility β†’ Reading Order.
  3. table_header_scope (βˆ’10) β€” emit <th scope="col"> / scope="row" in the converted HTML so WeasyPrint propagates /Scope into the PDF tag tree.
  4. heading_hierarchy (βˆ’5 to βˆ’10) β€” TOC detection already produces an H1; make sure downstream chunks don’t restart at H1 or skip from H1 β†’ H3.
  5. list_structure (βˆ’5) β€” ensure <ul>/<ol> survive conversion; don’t emit bare <p>β€’ item</p>.
  6. link_annotations (βˆ’5) β€” verify every <a href> in the HTML becomes both a /Subtype /Link annotation and a /S /Link structure element.
  7. artifact_marking (βˆ’5) β€” the postprocessor’s β€œmark untagged content as Artifact” fixup handles this; check it’s running for multi-page outputs.
  8. document_title, document_language, display_doc_title, bookmarks, tab_order β€” all fully handled by the postprocessor today. If one regresses, check postProcessAccessiblePdf is being called on that code path.

Implemented follow-ups (2026-04-23)

  • βœ… Per-check breakdown surfaced in the UI. Each AccessibilityCheckResult now carries a wcagCriteria: string[], and apps/web/src/components/lti/ScoreBreakdown.tsx renders failing checks with their WCAG 2.1 AA SC tags. Wired into FileScoreRow via a β€œWhy?” disclosure.
  • βœ… Round-trip regression test. workers/api/src/__tests__/services/pdf-accessibility-roundtrip.test.ts builds synthetic β€œbefore” PDFs with pdf-lib, runs postProcessAccessiblePdf, then re-scores β€” asserting the score strictly increases, no check regresses from pass to fail, and total deductions monotonically decrease.
  • βœ… StructParents-based reading-order check. checkReadingOrder no longer uses MCID monotonicity. It now validates that the structure tree actually expresses reading order (via /ParentTree + /StructParents), which is what ISO 32000-1 Β§14.7 defines as authoritative.

Remaining follow-ups

  • Gate the PDF/UA-1 claim. The post-processor currently stamps <pdfuaid:part>1</pdfuaid:part> in XMP unconditionally (see injectXmpMetadata in pdf-accessibility-postprocessor.ts). It should only be stamped when all 16 structural checks pass β€” otherwise downstream validators (veraPDF, Acrobat Preflight) will flag a false claim.
  • Run veraPDF / PAC 2024 in CI on a representative output as a Matterhorn-complete external check. Our scorer is a subset.
  • Add a WCAG 2.2 AA pass alongside 2.1 AA β€” backward-compatible, forward-looking for procurement language.
  • Uncompress /ObjStm before scoring so tag presence isn’t missed when writers use object-stream compression.

4. Quick reference β€” file map

ConcernFile
Scoring logic (16 checks, deductions)workers/api/src/services/pdf-accessibility-scorer.ts
Score-band thresholds + shared typespackages/shared/src/types.ts (getScoreBand, AccessibilityScore)
PDF fixups that raise the scoreworkers/api/src/services/pdf-accessibility-postprocessor.ts
Full-mode scoring call sitesroutes/convert.ts:281, scheduler/chunk-scheduler.ts:806
Quick-mode scoring call site (LTI scan)routes/lti-course.ts:157
Testsworkers/api/src/__tests__/services/pdf-accessibility-scorer.test.ts