Skip to content

PDF Accessibility Evaluator β€” Implementation Plan

Goal

Extend the existing pdf-accessibility-scorer.ts into a comprehensive PDF accessibility evaluator that validates not just the presence of accessibility features but their correctness. This is a prerequisite for the β€œexport as accessible PDF” feature β€” we need to measure quality before we ship.

Current State

workers/api/src/services/pdf-accessibility-scorer.ts performs 6 structural checks via raw byte inspection:

CheckDeductionWhat it measures
Extractable text-40Is there text at all, or image-only?
Tag structure-20Do /MarkInfo + /StructTreeRoot exist?
Image alt text-15 each (max -30)Do /S /Figure elements have /Alt?
Document language-10Does /Lang exist?
Document title-5Does /Title exist?
Table headers-10Do /S /Table have /S /TH?

Limitation: These are presence checks only. A PDF could have tags that are completely wrong (headings out of order, tables with no scope, reading order jumbled) and still score 100.

Design Principles

  1. No AI, no rendering β€” keep it fast and free (byte-level inspection only)
  2. Deduction-based scoring β€” same pattern as existing scorer
  3. Backward compatible β€” extend AccessibilityScore type, don’t break callers
  4. Two tiers β€” quick mode (existing 6 checks, for freemium) and full mode (all checks, for generated PDFs)
  5. Actionable results β€” each check explains what’s wrong and how to fix it

New Checks

Tier 1: Structure Correctness (add to existing scorer)

1.1 Heading Hierarchy (-10)

Validate that heading tags follow a logical order without skips.

Scan for /S /H1, /S /H2, ... /S /H6 in the structure tree.
Build the sequence. Fail if:
- No H1 exists (document has no top-level heading)
- Levels are skipped (H1 β†’ H3 with no H2)
- Multiple H1s (unless document has clear sections)

Byte pattern: Search for /S /H1 through /S /H6, record order of appearance.

Deduction: -10 if hierarchy is invalid, -5 if minor skip (e.g., H2 β†’ H4).

1.2 Reading Order Validation (-15)

Check that the structure tree tag order is plausible. In a tagged PDF, the order of structure elements determines screen reader reading order.

Extract the sequence of /S /xxx tags from the structure tree.
Fail if:
- Document has tags but they appear in a clearly wrong order
(e.g., all tags appear in reverse or random page order)
- Content is tagged but paragraphs are interleaved from
different columns (multi-column layout issue)

Approach: Compare the order of marked-content IDs (MCIDs) against their page positions. If MCIDs on page 1 reference content that appears after page 2 content, reading order is suspect.

Deduction: -15 if reading order appears scrambled, -5 if minor issues.

Note: This is the hardest check to do via byte inspection alone. A simpler v1 heuristic: verify that structure tree elements reference MCIDs in ascending page order (page 1 MCIDs before page 2 MCIDs, etc.).

1.3 Table Header Scope (-10)

Check that table header cells have proper scope attributes.

For each /S /TH element in the structure tree:
- Look for /A dictionary with /Scope attribute
- Valid values: /Column, /Row, /Both

Byte pattern: Find /S /TH, look for /Scope within the attribute dictionary.

Deduction: -10 if tables have TH without scope, -5 if some TH have scope but not all.

1.4 List Structure (-5)

Validate that lists use proper L/LI/Lbl/LBody structure.

Search for /S /L (list), /S /LI (list item),
/S /Lbl (label/bullet), /S /LBody (list body).
If content appears to be a list (bullet characters in text)
but no /S /L tags exist, deduct.
If /S /L exists, verify it contains /S /LI children.

Byte pattern: Count /S /L , /S /LI, check for bullet characters (β€’, –, numbered patterns) in text without corresponding list tags.

Deduction: -5 if lists exist but aren’t tagged.

Check that hyperlinks are tagged as /Link with meaningful content.

Count /Subtype /Link annotations.
Check if corresponding /S /Link structure elements exist.

Byte pattern: Count /Subtype /Link vs /S /Link.

Deduction: -5 if link annotations exist but no /S /Link structure tags.

Tier 2: PDF/UA Compliance (for generated PDFs)

2.1 PDF/UA Identifier (-10)

Check for PDF/UA conformance declaration in XMP metadata.

Search for pdfuaid:part in the metadata stream.
PDF/UA-1 requires: <pdfuaid:part>1</pdfuaid:part>

Byte pattern: Search for pdfuaid:part in the document.

Deduction: -10 if missing (required for PDF/UA compliance).

2.2 Tab Order (-5)

Each page should specify /Tabs /S (structure order) so keyboard tab order follows the tag tree.

For each /Type /Page dictionary, check for /Tabs /S.

Byte pattern: Within page dictionaries, look for /Tabs /S.

Deduction: -5 if any page lacks /Tabs /S.

2.3 Bookmarks / Document Outline (-5)

Check that the document has bookmarks derived from headings.

Look for /Type /Outlines in the catalog.
If the document has headings (H1-H6 tags) but no outlines,
deduct points.

Byte pattern: Search for /Type /Outlines or /Outlines in catalog dictionary.

Deduction: -5 if headings exist but no bookmarks.

2.4 Artifact Marking (-5)

Decorative elements (headers, footers, page numbers, watermarks) should be marked as artifacts, not tagged as content.

Look for /Type /Pagination or BMC/BDC artifact operators
in content streams.
A document with many pages but no artifact markers likely
has repeated header/footer content polluting the tag tree.

Byte pattern: Search for /Artifact in content stream BDC operators.

Deduction: -5 if multi-page document has no artifact markers.

2.5 Display Title Flag (-3)

The catalog should specify /ViewerPreferences << /DisplayDocTitle true >> so the title bar shows the document title instead of the filename.

Look for /DisplayDocTitle true in /ViewerPreferences.

Byte pattern: Search for /DisplayDocTitle followed by true.

Deduction: -3 if missing.


Scoring Summary

Existing checks (Tier 0): max -115

CheckMax deduction
Extractable text-40
Tag structure-20
Image alt text-30
Document language-10
Document title-5
Table headers-10

New Tier 1 checks: max -45

CheckMax deduction
Heading hierarchy-10
Reading order-15
Table header scope-10
List structure-5
Link annotations-5

New Tier 2 checks (PDF/UA): max -28

CheckMax deduction
PDF/UA identifier-10
Tab order-5
Bookmarks-5
Artifact marking-5
Display title flag-3

Scoring modes

type ScoringMode = 'quick' | 'full';
// quick: Tier 0 only (existing 6 checks) β€” fast, for freemium intake scoring
// full: Tier 0 + 1 + 2 (all checks) β€” for evaluating generated PDFs

Score remains max(0, 100 - deductions). With more checks, the deductions are more granular but the max is still 100 because a truly inaccessible PDF hits the cap quickly (no text + no tags = -60 already).


Implementation

File changes

FileChange
workers/api/src/services/pdf-accessibility-scorer.tsAdd new check functions, add mode parameter
@accessible-pdf/shared typesAdd new check IDs to AccessibilityCheckResult type

Function signature change

export function scorePdfAccessibility(
pdfBytes: Uint8Array,
options?: ScorerOptions & { mode?: 'quick' | 'full' },
): AccessibilityScore;
  • mode: 'quick' β€” existing 6 checks (default, backward compatible)
  • mode: 'full' β€” all checks including heading hierarchy, reading order, PDF/UA

New helper functions

// Tier 1
function validateHeadingHierarchy(pdfStr: string): CheckResult;
function validateReadingOrder(pdfStr: string): CheckResult;
function validateTableHeaderScope(pdfStr: string): CheckResult;
function validateListStructure(pdfStr: string): CheckResult;
function validateLinkAnnotations(pdfStr: string): CheckResult;
// Tier 2
function checkPdfUaIdentifier(pdfStr: string): CheckResult;
function checkTabOrder(pdfStr: string): CheckResult;
function checkBookmarks(pdfStr: string): CheckResult;
function checkArtifactMarking(pdfStr: string): CheckResult;
function checkDisplayTitle(pdfStr: string): CheckResult;

Each returns the same { passed, weight, deduction, detail } shape.


Usage for PDF Export Validation

Once built, the evaluator integrates into the PDF export pipeline:

1. Generate accessible HTML (existing pipeline)
2. Render HTML β†’ tagged PDF via Puppeteer (generateAccessiblePdfFromHtml)
3. Run scorePdfAccessibility(pdfBytes, { mode: 'full' })
4. If score < 80:
- Log deficiencies
- Apply post-processing fixes (pdf-lib) for fixable issues:
- Set /Lang if missing
- Set /Title if missing
- Add /DisplayDocTitle
- Set /Tabs /S on pages
- Add PDF/UA identifier to XMP metadata
- Re-score after fixes
5. Return PDF with score metadata

This creates a feedback loop: generate, measure, fix, verify. The scorer tells us exactly where Chrome’s tagged PDF output falls short so we can target post-processing.


Testing

Unit tests

  • Test each check function with crafted PDF byte patterns
  • Test scoring with known-accessible PDFs (from PAC-verified sources)
  • Test scoring with known-inaccessible PDFs (untagged, no lang, etc.)
  • Verify backward compatibility: mode: 'quick' produces same results as current scorer

Integration tests

  • Generate a PDF via generateAccessiblePdfFromHtml() with known accessible HTML
  • Score it with mode: 'full'
  • Verify the score and identify any consistent gaps from Chrome’s output
  • Document which checks Chrome passes/fails so post-processing can target the gaps

Test fixtures

Create a set of PDF test files:

  • tagged-accessible.pdf β€” fully tagged, all checks pass
  • untagged.pdf β€” no structure tree
  • scanned.pdf β€” image-only
  • bad-headings.pdf β€” H1 β†’ H3 skip
  • no-table-scope.pdf β€” TH without scope attributes
  • chrome-generated.pdf β€” output from generateAccessiblePdfFromHtml() to benchmark Chrome’s baseline

Effort Estimate

PhaseEffort
Tier 1 checks (heading, reading order, table scope, lists, links)2-3 days
Tier 2 checks (PDF/UA, tab order, bookmarks, artifacts, display title)1-2 days
Tests + fixtures1-2 days
Integration with PDF export pipeline1 day
Total5-8 days

Sequencing

  1. Build the enhanced scorer (this plan)
  2. Run it against Chrome tagged: true output to identify gaps
  3. Build targeted post-processing fixes for those gaps (pdf-lib)
  4. Wire up the PDF export endpoint with score validation
  5. Ship the β€œExport as Accessible PDF” feature