PDF Accessibility Evaluator — Implementation Plan

Goal

Extend the existing pdf-accessibility-scorer.ts into a comprehensive PDF accessibility evaluator that validates not just the presence of accessibility features but their correctness. This is a prerequisite for the “export as accessible PDF” feature — we need to measure quality before we ship.

Current State

workers/api/src/services/pdf-accessibility-scorer.ts performs 6 structural checks via raw byte inspection:

Check	Deduction	What it measures
Extractable text	-40	Is there text at all, or image-only?
Tag structure	-20	Do `/MarkInfo` + `/StructTreeRoot` exist?
Image alt text	-15 each (max -30)	Do `/S /Figure` elements have `/Alt`?
Document language	-10	Does `/Lang` exist?
Document title	-5	Does `/Title` exist?
Table headers	-10	Do `/S /Table` have `/S /TH`?

Limitation: These are presence checks only. A PDF could have tags that are completely wrong (headings out of order, tables with no scope, reading order jumbled) and still score 100.

Design Principles

No AI, no rendering — keep it fast and free (byte-level inspection only)
Deduction-based scoring — same pattern as existing scorer
Backward compatible — extend AccessibilityScore type, don’t break callers
Two tiers — quick mode (existing 6 checks, for freemium) and full mode (all checks, for generated PDFs)
Actionable results — each check explains what’s wrong and how to fix it

New Checks

Tier 1: Structure Correctness (add to existing scorer)

1.1 Heading Hierarchy (-10)

Validate that heading tags follow a logical order without skips.

Scan for /S /H1, /S /H2, ... /S /H6 in the structure tree.
Build the sequence. Fail if:
  - No H1 exists (document has no top-level heading)
  - Levels are skipped (H1 → H3 with no H2)
  - Multiple H1s (unless document has clear sections)

Byte pattern: Search for /S /H1 through /S /H6, record order of appearance.

Deduction: -10 if hierarchy is invalid, -5 if minor skip (e.g., H2 → H4).

1.2 Reading Order Validation (-15)

Check that the structure tree tag order is plausible. In a tagged PDF, the order of structure elements determines screen reader reading order.

Extract the sequence of /S /xxx tags from the structure tree.
Fail if:
  - Document has tags but they appear in a clearly wrong order
    (e.g., all tags appear in reverse or random page order)
  - Content is tagged but paragraphs are interleaved from
    different columns (multi-column layout issue)

Approach: Compare the order of marked-content IDs (MCIDs) against their page positions. If MCIDs on page 1 reference content that appears after page 2 content, reading order is suspect.

Deduction: -15 if reading order appears scrambled, -5 if minor issues.

Note: This is the hardest check to do via byte inspection alone. A simpler v1 heuristic: verify that structure tree elements reference MCIDs in ascending page order (page 1 MCIDs before page 2 MCIDs, etc.).

1.3 Table Header Scope (-10)

Check that table header cells have proper scope attributes.

For each /S /TH element in the structure tree:
  - Look for /A dictionary with /Scope attribute
  - Valid values: /Column, /Row, /Both

Byte pattern: Find /S /TH, look for /Scope within the attribute dictionary.

Deduction: -10 if tables have TH without scope, -5 if some TH have scope but not all.

1.4 List Structure (-5)

Validate that lists use proper L/LI/Lbl/LBody structure.

Search for /S /L (list), /S /LI (list item),
  /S /Lbl (label/bullet), /S /LBody (list body).
If content appears to be a list (bullet characters in text)
  but no /S /L tags exist, deduct.
If /S /L exists, verify it contains /S /LI children.

Byte pattern: Count /S /L , /S /LI, check for bullet characters (•, –, numbered patterns) in text without corresponding list tags.

Deduction: -5 if lists exist but aren’t tagged.

1.5 Link Annotations (-5)

Check that hyperlinks are tagged as /Link with meaningful content.

Count /Subtype /Link annotations.
Check if corresponding /S /Link structure elements exist.

Byte pattern: Count /Subtype /Link vs /S /Link.

Deduction: -5 if link annotations exist but no /S /Link structure tags.

Tier 2: PDF/UA Compliance (for generated PDFs)

2.1 PDF/UA Identifier (-10)

Check for PDF/UA conformance declaration in XMP metadata.

Search for pdfuaid:part in the metadata stream.
PDF/UA-1 requires: <pdfuaid:part>1</pdfuaid:part>

Byte pattern: Search for pdfuaid:part in the document.

Deduction: -10 if missing (required for PDF/UA compliance).

2.2 Tab Order (-5)

Each page should specify /Tabs /S (structure order) so keyboard tab order follows the tag tree.

For each /Type /Page dictionary, check for /Tabs /S.

Byte pattern: Within page dictionaries, look for /Tabs /S.

Deduction: -5 if any page lacks /Tabs /S.

2.3 Bookmarks / Document Outline (-5)

Check that the document has bookmarks derived from headings.

Look for /Type /Outlines in the catalog.
If the document has headings (H1-H6 tags) but no outlines,
  deduct points.

Byte pattern: Search for /Type /Outlines or /Outlines in catalog dictionary.

Deduction: -5 if headings exist but no bookmarks.

2.4 Artifact Marking (-5)

Decorative elements (headers, footers, page numbers, watermarks) should be marked as artifacts, not tagged as content.

Look for /Type /Pagination or BMC/BDC artifact operators
  in content streams.
A document with many pages but no artifact markers likely
  has repeated header/footer content polluting the tag tree.

Byte pattern: Search for /Artifact in content stream BDC operators.

Deduction: -5 if multi-page document has no artifact markers.

2.5 Display Title Flag (-3)

The catalog should specify /ViewerPreferences << /DisplayDocTitle true >> so the title bar shows the document title instead of the filename.

Look for /DisplayDocTitle true in /ViewerPreferences.

Byte pattern: Search for /DisplayDocTitle followed by true.

Deduction: -3 if missing.

Scoring Summary

Existing checks (Tier 0): max -115

Check	Max deduction
Extractable text	-40
Tag structure	-20
Image alt text	-30
Document language	-10
Document title	-5
Table headers	-10

New Tier 1 checks: max -45

Check	Max deduction
Heading hierarchy	-10
Reading order	-15
Table header scope	-10
List structure	-5
Link annotations	-5

New Tier 2 checks (PDF/UA): max -28

Check	Max deduction
PDF/UA identifier	-10
Tab order	-5
Bookmarks	-5
Artifact marking	-5
Display title flag	-3

Scoring modes

type ScoringMode = 'quick' | 'full';

// quick: Tier 0 only (existing 6 checks) — fast, for freemium intake scoring
// full:  Tier 0 + 1 + 2 (all checks) — for evaluating generated PDFs

Score remains max(0, 100 - deductions). With more checks, the deductions are more granular but the max is still 100 because a truly inaccessible PDF hits the cap quickly (no text + no tags = -60 already).

Implementation

File changes

File	Change
`workers/api/src/services/pdf-accessibility-scorer.ts`	Add new check functions, add `mode` parameter
`@accessible-pdf/shared` types	Add new check IDs to `AccessibilityCheckResult` type

Function signature change

export function scorePdfAccessibility(
  pdfBytes: Uint8Array,
  options?: ScorerOptions & { mode?: 'quick' | 'full' },
): AccessibilityScore;

mode: 'quick' — existing 6 checks (default, backward compatible)
mode: 'full' — all checks including heading hierarchy, reading order, PDF/UA

New helper functions

// Tier 1
function validateHeadingHierarchy(pdfStr: string): CheckResult;
function validateReadingOrder(pdfStr: string): CheckResult;
function validateTableHeaderScope(pdfStr: string): CheckResult;
function validateListStructure(pdfStr: string): CheckResult;
function validateLinkAnnotations(pdfStr: string): CheckResult;

// Tier 2
function checkPdfUaIdentifier(pdfStr: string): CheckResult;
function checkTabOrder(pdfStr: string): CheckResult;
function checkBookmarks(pdfStr: string): CheckResult;
function checkArtifactMarking(pdfStr: string): CheckResult;
function checkDisplayTitle(pdfStr: string): CheckResult;

Each returns the same { passed, weight, deduction, detail } shape.

Usage for PDF Export Validation

Once built, the evaluator integrates into the PDF export pipeline:

1. Generate accessible HTML (existing pipeline)
2. Render HTML → tagged PDF via Puppeteer (generateAccessiblePdfFromHtml)
3. Run scorePdfAccessibility(pdfBytes, { mode: 'full' })
4. If score < 80:
   - Log deficiencies
   - Apply post-processing fixes (pdf-lib) for fixable issues:
     - Set /Lang if missing
     - Set /Title if missing
     - Add /DisplayDocTitle
     - Set /Tabs /S on pages
     - Add PDF/UA identifier to XMP metadata
   - Re-score after fixes
5. Return PDF with score metadata

This creates a feedback loop: generate, measure, fix, verify. The scorer tells us exactly where Chrome’s tagged PDF output falls short so we can target post-processing.

Testing

Unit tests

Test each check function with crafted PDF byte patterns
Test scoring with known-accessible PDFs (from PAC-verified sources)
Test scoring with known-inaccessible PDFs (untagged, no lang, etc.)
Verify backward compatibility: mode: 'quick' produces same results as current scorer

Integration tests

Generate a PDF via generateAccessiblePdfFromHtml() with known accessible HTML
Score it with mode: 'full'
Verify the score and identify any consistent gaps from Chrome’s output
Document which checks Chrome passes/fails so post-processing can target the gaps

Test fixtures

Create a set of PDF test files:

tagged-accessible.pdf — fully tagged, all checks pass
untagged.pdf — no structure tree
scanned.pdf — image-only
bad-headings.pdf — H1 → H3 skip
no-table-scope.pdf — TH without scope attributes
chrome-generated.pdf — output from generateAccessiblePdfFromHtml() to benchmark Chrome’s baseline

Effort Estimate

Phase	Effort
Tier 1 checks (heading, reading order, table scope, lists, links)	2-3 days
Tier 2 checks (PDF/UA, tab order, bookmarks, artifacts, display title)	1-2 days
Tests + fixtures	1-2 days
Integration with PDF export pipeline	1 day
Total	5-8 days

Sequencing

Build the enhanced scorer (this plan)
Run it against Chrome tagged: true output to identify gaps
Build targeted post-processing fixes for those gaps (pdf-lib)
Wire up the PDF export endpoint with score validation
Ship the “Export as Accessible PDF” feature