Skip to content

Build "Accessible Forms" β€” PDF-to-HTML Form Conversion Product

Overview

Build a new product called Accessible Forms within the existing accessible monorepo at /Users/larryanglin/Projects/accessible/. This product converts PDF forms (AcroForms and XFA) into accessible, functional HTML forms with WCAG 2.2 AA compliance. It lives at forms.theaccessible.org (standalone) and theaccessible.org/forms (marketing entry point).

The key innovation over the existing premium-form-converter is a hybrid approach: programmatic extraction of PDF form structure (field types, names, options, values, positions, validation rules) combined with vision-model refinement for layout and styling. This replaces the current 100%-vision approach that guesses form structure from pixels.


Architecture & Stack

Follow the exact same patterns as the existing apps (links, music, photos). This is a monorepo β€” do not create a separate repository.

New files/directories to create:

apps/forms/ # Next.js 14 App Router frontend
workers/api/src/routes/forms.ts # Cloudflare Worker API routes (light operations)
workers/api/src/services/acroform-extractor.ts # AcroForm field extraction
workers/api/src/services/xfa-extractor.ts # XFA form extraction
workers/api/src/services/form-field-mapper.ts # Map extracted fields β†’ HTML
workers/api/src/services/form-hybrid-converter.ts # Hybrid pipeline orchestrator
packages/shared/src/form-types.ts # Shared form domain types
supabase/migrations/YYYYMMDD_forms_*.sql # Database tables

Extend existing:

workers/api/src/index.ts # Mount new /api/forms/* routes
workers/api/src/types/env.ts # Add R2_FORMS_BUCKET binding
workers/api/wrangler.toml # Add R2 bucket binding
packages/shared/src/index.ts # Export new form types

Stack (matching existing products):

LayerTechnologyNotes
FrontendNext.js 14, App Router, TailwindCSSapps/forms/
UI Library@anglinai/ui + @accessible-org/uiCorporateHeader, CorporateFooter, ThemeProvider
AuthSupabase Auth (same instance as other apps)Google + email/password, shared auth-context.tsx pattern
DatabaseSupabase PostgreSQL (same instance)New tables for form jobs, form field metadata
Light APICloudflare Workers (Hono)Extend existing workers/api/ β€” add routes/forms.ts
Heavy ProcessingExisting Node.js workerExtend with form-specific endpoints β€” already has Puppeteer, pdf-lib, unpdf
StorageCloudflare R2New bucket accessible-forms for uploaded PDFs + output HTML
AI/VisionClaude API (Anthropic)For vision refinement passes (2-3 iterations, not 8)
PaymentsStripe (existing credit system)Same credit_balances / credit_transactions tables

Phase 1: AcroForm Extractor (acroform-extractor.ts)

This is the foundational service. Build it first.

What it does:

Uses pdf-lib (already installed) to read the AcroForm dictionary from a PDF and extract structured metadata for every field.

Output type (FormField in packages/shared/src/form-types.ts):

export interface FormField {
/** Unique field name from PDF (e.g., "topmostSubform[0].Page1[0].f1_01[0]") */
name: string;
/** Human-readable alternate name / tooltip (from /TU entry) */
alternativeName?: string;
/** Field type */
type: 'text' | 'checkbox' | 'radio' | 'dropdown' | 'listbox' | 'signature' | 'button' | 'barcode';
/** Current value (pre-filled data) */
value?: string | boolean | string[];
/** Default value */
defaultValue?: string | boolean | string[];
/** For dropdowns/listboxes: available options */
options?: { displayValue: string; exportValue: string }[];
/** Bounding box in PDF coordinates [x1, y1, x2, y2] */
rect: [number, number, number, number];
/** 1-based page number */
page: number;
/** Tab order index (if specified in PDF) */
tabIndex?: number;
/** Validation constraints */
validation: {
required: boolean;
readOnly: boolean;
maxLength?: number;
/** Format category from PDF actions (e.g., 'date', 'number', 'ssn', 'zip', 'phone', 'email') */
formatType?: string;
/** Raw format mask/pattern */
formatMask?: string;
};
/** For radio buttons: the group name (all radios in group share this) */
radioGroupName?: string;
/** For radio buttons: this button's export value within the group */
radioExportValue?: string;
/** Font info from default appearance string */
appearance?: {
fontSize?: number;
fontName?: string;
textColor?: string;
alignment?: 'left' | 'center' | 'right';
};
/** Calculation script (if field is calculated) */
calculationScript?: string;
}
export interface FormExtractionResult {
fields: FormField[];
/** Total pages in the PDF */
pageCount: number;
/** Whether this PDF uses XFA (vs AcroForm) */
isXFA: boolean;
/** Page dimensions for coordinate mapping */
pageDimensions: { page: number; width: number; height: number }[];
/** Document-level metadata */
metadata: {
title?: string;
author?: string;
language?: string;
};
/** Warnings encountered during extraction */
warnings: string[];
}

Implementation notes:

  • Use pdf-lib’s PDFDocument.load() and traverse the AcroForm dictionary
  • Access field widgets via doc.catalog.lookup(PDFName.of('AcroForm')) and iterate the /Fields array
  • Each field’s /FT (field type) maps to: /Tx β†’ text, /Btn β†’ checkbox/radio, /Ch β†’ dropdown/listbox, /Sig β†’ signature
  • Distinguish checkbox vs radio via the /Ff flags (bit 16 = radio)
  • Extract /Opt array for dropdown/listbox options (each entry may be a string or [exportValue, displayValue] pair)
  • Extract /V (current value), /DV (default value), /TU (tooltip/alt name), /Rect (position)
  • Parse /AA (additional actions) for calculation and validation scripts
  • Parse /DA (default appearance) for font/size/color
  • Extract /MaxLen for text field max length
  • Check /Ff flag bits: bit 1 = readOnly, bit 2 = required
  • Handle field hierarchies (parent/child fields in the AcroForm tree) β€” fully qualified field name uses dot notation

Test coverage:

  • Write tests using real-world PDF form fixtures (create small test PDFs with pdf-lib that have each field type)
  • Test: text fields, checkboxes, radio groups, dropdowns with options, signature fields, required fields, read-only fields, pre-filled values, multi-page forms, nested field hierarchies

Phase 2: XFA Extractor (xfa-extractor.ts)

What it does:

Reads the XFA stream from the PDF catalog, parses the XML, and extracts field definitions into the same FormField[] structure.

Implementation notes:

  • Check for /XFA key in the PDF catalog’s AcroForm dictionary
  • XFA data is stored as XML streams (may be segmented: template, datasets, config, localeSet)
  • The template XML contains field definitions: <field>, <subform>, <draw>, <exclGroup> (radio groups)
  • Parse with a fast XML parser (add fast-xml-parser as a dependency)
  • Map XFA field types to our FormField.type:
    • <field> with <ui><textEdit> β†’ text
    • <field> with <ui><checkButton> β†’ checkbox
    • <field> with <ui><choiceList> β†’ dropdown or listbox
    • <field> with <ui><dateTimeEdit> β†’ text with formatType β€˜date’
    • <field> with <ui><signature> β†’ signature
    • <exclGroup> β†’ radio group
  • Extract <items> children for dropdown options
  • Extract <validate> elements for validation rules
  • Extract <calculate> elements for calculated fields
  • Map XFA coordinate system to page coordinates using <contentArea> dimensions
  • Handle dynamic XFA (growable subforms, repeatable rows) β€” flag these in warnings since HTML can’t fully replicate dynamic XFA behavior

XFA detection in preflight:

Update pdf-preflight.ts to detect and flag XFA forms separately from AcroForms. Set isXFA: true in the extraction result.


Phase 3: Form Field Mapper (form-field-mapper.ts)

What it does:

Takes a FormField[] array and generates a skeleton HTML form with correct semantic elements, field types, attributes, groupings, and basic CSS positioning.

Output:

A complete <form> HTML string with:

  • Proper <input>, <select>, <textarea> elements matching field types
  • <label for="id"> associations (using alternativeName or name as label text)
  • <fieldset> + <legend> wrapping radio/checkbox groups
  • Pre-filled value, checked, selected attributes from extracted data
  • required, readonly, maxlength, pattern, type (email/date/tel/number) from validation
  • autocomplete attributes based on field name heuristics (name, email, phone, address, etc.)
  • inputmode attributes for mobile keyboards
  • tabindex matching PDF tab order
  • CSS positioning derived from field rect coordinates mapped to relative page layout
  • Signature fields rendered with a clear β€œSign here” visual treatment and role="img" or canvas placeholder
  • DOM order matching visual reading order (top-to-bottom, left-to-right within rows)

Field name β†’ label heuristic:

PDF field names are often cryptic (f1_01, topmostSubform[0].Page1[0].SSN[0]). Use the alternativeName (tooltip) first. If unavailable, apply heuristics:

  • Strip topmostSubform[0].PageN[0]. prefixes
  • Convert camelCase/PascalCase to spaces
  • Strip trailing [0] array indices
  • Flag fields with no usable label text β€” these will need vision-model label extraction

Coordinate mapping:

  • PDF coordinates: origin at bottom-left, units in points (1/72 inch)
  • HTML coordinates: origin at top-left, units in pixels
  • Convert: htmlY = (pageHeight - pdfY) * scale, htmlX = pdfX * scale
  • Group fields on the same horizontal band into flex rows
  • Use relative positioning within a page container, not absolute positioning

Phase 4: Hybrid Converter (form-hybrid-converter.ts)

What it does:

Orchestrates the two-phase pipeline:

Phase A β€” Structural extraction (programmatic, fast, cheap):

  1. Run AcroForm or XFA extractor β†’ get FormField[]
  2. Run form-field-mapper β†’ generate skeleton HTML form
  3. This skeleton has correct field types, names, groups, options, values, validation β€” but may have imperfect labels and layout

Phase B β€” Vision refinement (LLM, 1-3 iterations):

  1. Render skeleton HTML in browser β†’ screenshot
  2. Send to Claude: [Original PDF] + [Screenshot] + [Skeleton HTML] + [Extracted FormField[] JSON]
  3. Prompt: β€œThe skeleton HTML was generated from programmatic extraction. The field types, names, options, and values are correct. Your job is to: (a) fix label text using the PDF as reference, (b) adjust layout/alignment to match the PDF, (c) add section headings and visual structure, (d) improve CSS styling. Do NOT change field types, names, option lists, or values β€” those are authoritative.”
  4. Iterate until NO_CHANGES_NEEDED or max 3 iterations
  5. Run axe-core WCAG 2.2 AA validation β†’ remediation pass if needed

Key differences from existing premium-form-converter:

  • Starts from a structurally correct skeleton, not raw converted HTML
  • LLM only handles label text + layout + styling (not field type guessing)
  • 3 iterations max instead of 8 (structure is already right)
  • Passes extracted FormField[] JSON as context so the LLM knows what’s authoritative
  • ~60-70% cheaper per form

Fork the existing code:

Copy premium-form-converter.ts as a starting point. Replace the iteration prompt with the hybrid-specific prompt described above. Keep the axe-core validation pass, progress callbacks, and cost tracking.


Phase 5: Database Schema

Create a migration file supabase/migrations/YYYYMMDD_forms_tables.sql:

-- Form conversion jobs
CREATE TABLE public.form_conversions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
original_name TEXT NOT NULL,
file_size_bytes BIGINT NOT NULL,
page_count INTEGER,
field_count INTEGER,
is_xfa BOOLEAN DEFAULT FALSE,
-- R2 storage keys
input_r2_key TEXT NOT NULL,
skeleton_r2_key TEXT,
output_r2_key TEXT,
-- Status tracking
status TEXT NOT NULL DEFAULT 'pending' CHECK (status IN ('pending', 'extracting', 'mapping', 'refining', 'validating', 'completed', 'failed')),
progress INTEGER DEFAULT 0 CHECK (progress >= 0 AND progress <= 100),
phase TEXT,
error TEXT,
-- Conversion metrics
extraction_duration_ms INTEGER,
refinement_iterations INTEGER,
total_duration_ms INTEGER,
input_tokens INTEGER,
output_tokens INTEGER,
estimated_cost_usd NUMERIC(10,6),
credits_charged INTEGER,
-- Quality metrics
wcag_violations_found INTEGER,
wcag_violations_fixed INTEGER,
fields_extracted INTEGER,
fields_in_output INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
-- Extracted form fields (for analytics and debugging)
CREATE TABLE public.form_fields (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversion_id UUID NOT NULL REFERENCES public.form_conversions(id) ON DELETE CASCADE,
field_name TEXT NOT NULL,
field_type TEXT NOT NULL,
page_number INTEGER NOT NULL,
has_label BOOLEAN DEFAULT FALSE,
has_value BOOLEAN DEFAULT FALSE,
has_options BOOLEAN DEFAULT FALSE,
option_count INTEGER DEFAULT 0,
is_required BOOLEAN DEFAULT FALSE,
is_readonly BOOLEAN DEFAULT FALSE,
rect JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes
CREATE INDEX idx_form_conversions_user ON public.form_conversions(user_id, created_at DESC);
CREATE INDEX idx_form_conversions_status ON public.form_conversions(status);
CREATE INDEX idx_form_fields_conversion ON public.form_fields(conversion_id);
-- RLS
ALTER TABLE public.form_conversions ENABLE ROW LEVEL SECURITY;
ALTER TABLE public.form_fields ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Users can view own conversions" ON public.form_conversions
FOR SELECT USING (auth.uid() = user_id);
CREATE POLICY "Users can insert own conversions" ON public.form_conversions
FOR INSERT WITH CHECK (auth.uid() = user_id);
CREATE POLICY "Users can update own conversions" ON public.form_conversions
FOR UPDATE USING (auth.uid() = user_id);
CREATE POLICY "Users can view own form fields" ON public.form_fields
FOR SELECT USING (
EXISTS (SELECT 1 FROM public.form_conversions fc WHERE fc.id = conversion_id AND fc.user_id = auth.uid())
);
-- Service role handles inserts/updates to form_fields (from the worker)

Phase 6: API Routes (workers/api/src/routes/forms.ts)

Mount at /api/forms/* in the existing Hono worker.

Endpoints:

POST /api/forms/upload Upload a PDF form β†’ returns jobId
GET /api/forms/:jobId Get conversion status + metadata
GET /api/forms/:jobId/download Download converted HTML
GET /api/forms/:jobId/fields Get extracted field metadata (for debugging/preview)
DELETE /api/forms/:jobId Delete a conversion and its R2 files
GET /api/forms/history List user's past conversions (paginated)
POST /api/forms/:jobId/retry Retry a failed conversion

Upload flow:

  1. Validate file (PDF, under 50MB, not encrypted)
  2. Run preflight to detect form type (AcroForm vs XFA) and field count
  3. Calculate credit cost: Math.ceil(pageCount * FORM_CREDIT_MULTIPLIER) β€” define FORM_CREDIT_MULTIPLIER = 3 in shared constants (cheaper than premium-form’s 8 because hybrid is more efficient)
  4. Check credit balance, deduct credits
  5. Store PDF in R2 at forms/{userId}/{jobId}/input.pdf
  6. Create form_conversions row with status β€˜pending’
  7. Dispatch to Node worker for heavy processing (or use waitUntil for Cloudflare background)
  8. Return { jobId, status: 'pending', fieldCount, isXFA, creditsCharged }

Processing pipeline (runs async):

  1. Status β†’ β€˜extracting’: Run AcroForm or XFA extractor
  2. Status β†’ β€˜mapping’: Run form-field-mapper to generate skeleton HTML
  3. Store skeleton in R2 at forms/{userId}/{jobId}/skeleton.html
  4. Status β†’ β€˜refining’: Run hybrid converter (2-3 vision iterations)
  5. Status β†’ β€˜validating’: Run axe-core WCAG validation + remediation
  6. Status β†’ β€˜completed’: Store final HTML in R2 at forms/{userId}/{jobId}/output.html

Phase 7: Frontend (apps/forms/)

Structure:

Follow the exact same pattern as apps/music/ or apps/links/:

apps/forms/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ app/
β”‚ β”‚ β”œβ”€β”€ layout.tsx # ThemeProvider, AuthProvider, CorporateHeader, CorporateFooter
β”‚ β”‚ β”œβ”€β”€ page.tsx # Landing page (marketing + upload CTA)
β”‚ β”‚ β”œβ”€β”€ globals.css # @anglinai/ui theme imports + Tailwind
β”‚ β”‚ β”œβ”€β”€ auth/
β”‚ β”‚ β”‚ └── callback/route.ts # Supabase auth callback
β”‚ β”‚ β”œβ”€β”€ dashboard/
β”‚ β”‚ β”‚ β”œβ”€β”€ page.tsx # List of past conversions
β”‚ β”‚ β”‚ └── [jobId]/
β”‚ β”‚ β”‚ β”œβ”€β”€ page.tsx # Conversion detail + download
β”‚ β”‚ β”‚ └── preview/
β”‚ β”‚ β”‚ └── page.tsx # Live preview of converted form
β”‚ β”‚ β”œβ”€β”€ pricing/
β”‚ β”‚ β”‚ └── page.tsx # Credit packages + pricing
β”‚ β”‚ └── docs/
β”‚ β”‚ └── page.tsx # Documentation / how it works
β”‚ β”œβ”€β”€ components/
β”‚ β”‚ β”œβ”€β”€ layout/
β”‚ β”‚ β”‚ β”œβ”€β”€ AppHeader.tsx # CorporateHeader with forms nav links
β”‚ β”‚ β”‚ β”œβ”€β”€ SiteFooter.tsx # CorporateFooter
β”‚ β”‚ β”‚ └── ServiceBanner.tsx
β”‚ β”‚ β”œβ”€β”€ upload/
β”‚ β”‚ β”‚ β”œβ”€β”€ FormDropZone.tsx # Drag-and-drop PDF upload
β”‚ β”‚ β”‚ └── UploadProgress.tsx
β”‚ β”‚ β”œβ”€β”€ conversion/
β”‚ β”‚ β”‚ β”œβ”€β”€ ConversionStatus.tsx # Real-time status with progress bar
β”‚ β”‚ β”‚ β”œβ”€β”€ FieldPreview.tsx # Show extracted fields before conversion
β”‚ β”‚ β”‚ └── FormPreview.tsx # Iframe preview of converted HTML
β”‚ β”‚ └── dashboard/
β”‚ β”‚ └── ConversionHistory.tsx # Table of past conversions
β”‚ β”œβ”€β”€ lib/
β”‚ β”‚ β”œβ”€β”€ supabase.ts # Supabase client (copy from music/links)
β”‚ β”‚ β”œβ”€β”€ auth-context.tsx # Auth context (copy from music/links)
β”‚ β”‚ β”œβ”€β”€ api.ts # API client for /api/forms/*
β”‚ β”‚ └── strings.ts # i18n string keys
β”‚ β”œβ”€β”€ hooks/
β”‚ β”‚ β”œβ”€β”€ useConversion.ts # Poll conversion status
β”‚ β”‚ └── useCredits.ts # Credit balance hook
β”‚ β”œβ”€β”€ locales/
β”‚ β”‚ └── en.json # All UI strings externalized
β”‚ └── __tests__/
β”‚ β”œβ”€β”€ components/
β”‚ β”œβ”€β”€ a11y/
β”‚ └── hooks/
β”œβ”€β”€ public/
β”‚ β”œβ”€β”€ favicon.ico
β”‚ β”œβ”€β”€ favicon.svg
β”‚ └── site.webmanifest
β”œβ”€β”€ tailwind.config.js # @anglinai/ui preset + primary colors
β”œβ”€β”€ next.config.js
β”œβ”€β”€ tsconfig.json
β”œβ”€β”€ package.json
β”œβ”€β”€ vitest.config.ts
└── playwright.config.ts

Landing page features:

  • Hero: β€œConvert PDF Forms to Accessible HTML” with upload dropzone
  • How it works: 3-step visual (Upload β†’ Extract β†’ Download)
  • Feature highlights: AcroForm + XFA support, WCAG 2.2 AA, pre-filled data preservation, field validation
  • Before/after comparison slider showing PDF β†’ HTML form
  • Pricing section (credit packages)
  • FAQ section

Dashboard features:

  • Table of past conversions with status, date, page count, field count
  • Click to view details: extracted fields, download HTML, preview in iframe
  • Upload new form button

Conversion detail page:

  • Real-time progress indicator during conversion
  • After completion: side-by-side preview (original PDF vs converted HTML)
  • Download button for HTML output
  • Field extraction summary (X text fields, Y checkboxes, Z dropdowns, etc.)
  • WCAG compliance badge (pass/fail with details)

Phase 8: Form Submission & Data Export

The converted HTML form should be functional, not just visual. Add these capabilities to the output HTML:

Client-side (embedded in the HTML output):

  • A <script> block at the bottom of the HTML that provides:
    • β€œDownload as JSON” button β€” serializes all form field values to JSON and triggers download
    • β€œDownload as CSV” button β€” serializes to CSV
    • β€œPrint” button β€” triggers window.print() with print-optimized CSS
    • β€œReset” button β€” clears all fields
  • These scripts are self-contained (no external dependencies) so the HTML works as a standalone file

Optional webhook (future):

  • Allow users to configure a <form action="https://..."> POST target
  • Not in MVP β€” just the client-side export buttons

Cross-Cutting Requirements

Testing (80% coverage minimum):

  • Unit tests for AcroForm extractor (test each field type, edge cases)
  • Unit tests for XFA extractor (test XML parsing, field mapping)
  • Unit tests for form-field-mapper (test HTML generation, label heuristics, coordinate mapping)
  • Integration tests for hybrid converter (mock vision model, verify iteration loop)
  • API route tests (upload, status polling, download, error handling)
  • Frontend component tests (upload flow, status display, preview)
  • Accessibility tests (axe-core in Vitest for all rendered components)
  • E2E tests with Playwright (upload a PDF, wait for conversion, download result)
  • Mobile tests (iPhone 14, iPad, Pixel 7 viewports)

Accessibility (WCAG 2.2 AA):

  • The product UI itself must be fully accessible (not just the output)
  • All form upload interactions keyboard-navigable
  • Progress indicators announced to screen readers (role="progressbar", aria-live)
  • Preview iframe has proper title attribute
  • Skip links, focus management on route changes
  • Color contrast AA on all text

i18n:

  • All UI strings in locales/en.json
  • Use next-intl or equivalent
  • No hardcoded user-facing strings in components

Performance:

  • File upload: stream to R2, don’t buffer entire file in memory
  • Status polling: use exponential backoff (1s β†’ 2s β†’ 4s β†’ 8s, cap at 10s)
  • Dashboard: paginate with cursor-based pagination for large histories
  • Extracted fields: cache in Supabase, don’t re-extract on every view

Security:

  • Validate uploaded files are actually PDFs (magic bytes check)
  • Enforce max file size (50MB)
  • Rate limit uploads (10/minute per user)
  • Sanitize output HTML (strip any residual <script> from LLM output, except our own export scripts)
  • RLS on all database tables

SEO & Meta:

  • Landing page: unique title, description, OG tags
  • sitemap.xml via Next.js app/sitemap.ts
  • robots.txt (allow landing + docs, block dashboard)
  • JSON-LD structured data on landing page

Deployment

DNS & Routing:

  • forms.theaccessible.org β†’ Cloudflare Pages (apps/forms)
  • theaccessible.org/forms β†’ redirect to forms.theaccessible.org (add to existing web app’s next.config.js rewrites/redirects)
  • API calls from the frontend go to the existing api-pdf.theaccessible.org worker at /api/forms/*

R2 Bucket:

  • Create new bucket accessible-forms in Cloudflare
  • Add binding R2_FORMS_BUCKET to workers/api/wrangler.toml

Build Counter:

Register the new app in the tagzen Supabase build_counters table:

INSERT INTO public.build_counters (app_id, counter, prefix, description)
VALUES ('accessible-forms', 0, '1.0.0', 'Accessible Forms - PDF to HTML form converter');

Environment Variables (apps/forms/.env.local):

NEXT_PUBLIC_SUPABASE_URL=<same as other apps>
NEXT_PUBLIC_SUPABASE_ANON_KEY=<same as other apps>
NEXT_PUBLIC_API_URL=https://api-pdf.theaccessible.org
NEXT_PUBLIC_APP_ENV=development

What NOT to build (out of scope for this prompt):

  • PDF re-generation (filling converted HTML back into a PDF)
  • Real-time collaborative form filling
  • Form builder/designer UI
  • Custom branding on output forms (beyond basic styling)
  • Multi-language form conversion (translate field labels)
  • OCR for scanned paper forms (the existing vision pipeline handles this)

Order of Operations

Build in this sequence β€” each phase depends on the previous:

  1. Shared types (packages/shared/src/form-types.ts) β€” FormField, FormExtractionResult, etc.
  2. AcroForm extractor (workers/api/src/services/acroform-extractor.ts) + tests
  3. XFA extractor (workers/api/src/services/xfa-extractor.ts) + tests
  4. Form field mapper (workers/api/src/services/form-field-mapper.ts) + tests
  5. Database migration β€” form_conversions, form_fields tables
  6. API routes (workers/api/src/routes/forms.ts) β€” upload, status, download, history
  7. Hybrid converter (workers/api/src/services/form-hybrid-converter.ts) β€” fork premium-form-converter, integrate extractor + mapper
  8. Frontend app (apps/forms/) β€” landing page, upload, dashboard, preview
  9. Data export scripts β€” embedded JSON/CSV/print in output HTML
  10. E2E tests β€” full upload-to-download flow
  11. Accessibility audit β€” axe-core on all pages, fix violations
  12. Mobile tests β€” Playwright device emulation
  13. Deploy β€” Cloudflare Pages, R2 bucket, DNS, build counter registration

Start with phases 1-4 (the extraction engine) since they’re the foundation. The frontend and API can be built in parallel once the core services exist.