Skip to content

Word Document (.docx) Support

Overview

The converter accepts .docx files (Word 2007+, Google Docs export, LibreOffice export) and converts them to accessible HTML using mammoth.js. The converted HTML then flows through the same accessibility pipeline as PDF conversions: WCAG validation, UX optimization, axe-core auto-fixes, and R2 storage.

Unlike PDF conversion, DOCX conversion requires no external API keys — mammoth runs entirely locally. This makes it the cheapest conversion path at $0/document.

How It Works

Conversion Pipeline

.docx upload
-> mammoth.js (DOCX -> semantic HTML + image extraction)
-> structurePages (page header/footer detection)
-> optimizeDeterministic (CSS injection, table headers, SVG labels, LaTeX->MathML)
-> enhanceAccessibility (DOCTYPE, lang, title, skip-link, landmarks)
-> validateAndFix (WCAG 2.1 AA validation + auto-fix loop)
-> R2 storage

Why mammoth.js

Mammoth converts based on semantic meaning, not visual formatting. It reads the document’s style metadata (Heading 1, Quote, List, etc.) and maps them to proper HTML elements (<h1>, <blockquote>, <ul>, etc.). This produces much cleaner HTML than a PDF-to-HTML conversion of the same document, because DOCX files retain the original document structure that PDF flattens away.

Image Handling

Embedded images are extracted from the DOCX file and routed through the standard pipeline:

  1. Mammoth extracts each image and assigns it a filename (image-1.png, image-2.jpeg, etc.)
  2. Images are returned as ConvertedImage[] alongside the HTML
  3. If enhanceImages is enabled and an Anthropic API key is set, Claude Vision generates alt text
  4. Images are stored in R2 and embedded as base64 data URIs in the final HTML

Custom Style Mappings

Mammoth ships with default mappings for headings (1-6), lists (ordered/unordered, 5 levels deep), bold (<strong>), italic (<em>), strikethrough (<s>), and basic paragraph styles.

We add custom mappings in docx-converter.ts to cover common Word styles that have clear semantic HTML equivalents but aren’t in mammoth’s defaults:

Word StyleHTML OutputWhy
Quote<blockquote><p>Semantic quotation markup
Intense Quote<blockquote><p>Same — Word has two quote styles
Block Text<blockquote><p>Another common quote variant
Caption<figcaption>Associates captions with figures/tables
Title<h1>Document title should be the primary heading
Subtitle<p class="doc-subtitle">Visually distinct but not a heading
Code / Code Block / HTML Code<code> or <pre>Preserves code semantics for screen readers
TOC Heading<h2 class="toc-heading">Heading for table of contents section
toc 1 / toc 2 / toc 3<p class="toc-entry toc-N">Structured TOC entries
Emphasis<em>Character-level emphasis

Adding More Custom Mappings

To add a mapping, edit the ACCESSIBILITY_STYLE_MAP array in workers/api/src/services/docx-converter.ts. The syntax is:

"selector => html-element:modifier"

Selectors:

  • p[style-name='...'] — paragraph style (by display name)
  • r[style-name='...'] — run/character style (inline)
  • p.StyleId — paragraph style (by internal ID)
  • b / i / u / strike — formatting
  • p:ordered-list(N) / p:unordered-list(N) — list at nesting level N

HTML targets:

  • Any HTML element: h1, p, blockquote, pre, code, em, strong, etc.
  • With classes: p.my-class
  • Nested: blockquote > p:fresh
  • :fresh modifier — always creates a new parent element (prevents merging siblings)
  • ! — ignore/suppress the content entirely

Example: To map a custom Word style called “Legal Note” to an <aside>:

"p[style-name='Legal Note'] => aside > p:fresh"

Per-conversion custom mappings can also be passed via the DocxConverterConfig.styleMap option, which takes precedence over the built-in accessibility mappings.

Usage

Upload & Convert

Upload and conversion work identically to PDF files — the frontend auto-detects the file type from the MIME type. The converter options (parser selection) are ignored for DOCX files since mammoth is the only conversion backend.

API

POST /api/files/upload
{ fileName: "report.docx", fileType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document", fileSize: 12345 }
PUT /api/files/:fileId/upload-data
[binary .docx data]
POST /api/convert/:fileId
{} (parser options are ignored for DOCX)

Limitations

Format Support

  • Only .docx (Office Open XML) is supported — not .doc (legacy binary format), .odt, or .rtf.
  • Documents must be well-formed OOXML. Corrupted or partially-written files will fail.

Structural Fidelity

Mammoth converts based on semantic styles, not visual layout. This means:

  • Direct formatting without styles is partially preserved — bold, italic, and strikethrough applied directly (not via a named style) are converted. Underline is intentionally ignored because underlined text is easily confused with hyperlinks, which is an accessibility problem.
  • Visual-only formatting is lost — font sizes, colors, margins, custom spacing, and decorative borders are ignored. This is by design: the output relies on our UX optimizer CSS for consistent, accessible styling.
  • Headers and footers are not included — mammoth does not extract running headers/footers from the DOCX document sections.
  • Complex page layouts (multi-column, text wrapping around images, absolute positioning) are flattened to linear flow. DOCX documents with complex visual layouts will produce correct content but in a single-column reading order.
  • Table formatting (borders, cell colors, column widths) is discarded. The table structure (rows, cells) and text content are preserved. Our UX optimizer CSS applies consistent table styling.
  • Embedded objects (charts, SmartArt, ActiveX controls, embedded Excel sheets) are not converted. Only raster images (PNG, JPEG, GIF, TIFF) and EMF/WMF images are extracted.
  • Equations — Word’s native equation editor (OMML) is not supported by mammoth. If the document contains MathType or OMML equations, they will appear as images (if embedded) or be missing. For math-heavy documents, PDF conversion via Mathpix or the hybrid pipeline remains the better choice.
  • Comments and tracked changes — Comments are ignored by default. Tracked changes (revision marks) are accepted as-is; the “final” version of the text is what gets converted.
  • Form fields (checkboxes, dropdowns, text inputs) are not converted to HTML form elements.

Security

Mammoth does not sanitize the source document. Our downstream pipeline (WCAG validator, UX optimizer) handles HTML cleanup, but be aware that a maliciously crafted DOCX could inject HTML through style names or text content. The existing enhanceAccessibility and validateAndFix pipeline mitigates most risks, but this is worth noting for defense-in-depth.

When to Use PDF Conversion Instead

DOCX conversion works best for text-heavy documents with proper style usage (headings, lists, tables). Prefer PDF conversion for:

  • Scanned documents (need OCR)
  • Math-heavy papers (need MathML via Mathpix)
  • Documents where visual layout fidelity matters
  • Documents originally authored as PDFs (forms, brochures, slide decks exported to PDF)