Prompt Optimizer

Iteratively refines the PDF-to-HTML conversion prompt using Gemini Flash. Each iteration converts a PDF page, scores the output, then asks Gemini to analyze what went wrong and write an improved prompt. Repeats until quality is sufficient or improvement plateaus.

Cost: ~$0.005 per iteration. A full 10-iteration run costs ~$0.05 per file.

Prerequisites

Set your Gemini API key in tools/benchmark-cli/.env:

GEMINI_API_KEY=your-key-here

Usage

All commands run from tools/benchmark-cli/.

Single file

npx tsx src/optimize-prompt.ts --file ~/Documents/sample.pdf

Folder of PDFs (recommended)

Chains the best prompt from each file into the next. This produces a prompt that generalizes across document types rather than overfitting to one.

npx tsx src/optimize-prompt.ts --dir ~/Documents/pdfs/ --output optimized-prompt.txt

Test a refined prompt against a different set of files

# Round 1: optimize against training set
npx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output round1.txt

# Round 2: test against a different set, starting from round 1's prompt
npx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat round1.txt | grep -v '^#')" --output round2.txt

Options

Flag	Default	Description
`--file <path>`	—	Single PDF file to optimize against
`--dir <path>`	—	Folder of PDFs (chains prompt across files)
`--page <n>`	1	Page number to use from each file
`--max-iterations <n>`	10	Max refinement iterations per file
`--target-score <n>`	90	Stop when score reaches this threshold
`--patience <n>`	3	Stop after N iterations without improvement
`--model <name>`	gemini-3-flash-preview	Gemini model to use
`--prompt "<text>"`	built-in default	Custom initial prompt
`--output <path>`	—	Save best prompt to a text file

Output

Console output

The optimizer prints iteration-by-iteration progress:

=== Prompt Optimizer ===
Files:          3 PDF(s)
Page:           1
Model:          gemini-3-flash-preview
Max iterations: 10 per file
Target score:   90
Mode:           chained (best prompt carries forward to next file)

[1/3] chemistry.pdf
------------------------------------------------------------
Prompt Optimizer: iteration 1/10 — converting...
Prompt Optimizer: iteration 1 — score 65 (2340ms)
Prompt Optimizer: iteration 1 — refining prompt...
Prompt Optimizer: iteration 2/10 — converting...
Prompt Optimizer: iteration 2 — score 78 (1890ms)
...
    1. score: 65
       Low semantic ratio (45%), too many divs. Added instructions to use semantic HTML tags.
    2. score: 78 (+13)
       Missing table headers. Added explicit <th> instructions for tabular data.
    3. score: 85 (+7)
       Heading hierarchy skipped h2. Added sequential heading level requirement.
    4. score: 91 (+6) <-- best
  Result: 65 -> 91 (+26) | target-reached | 32s | $0.0234

[2/3] biology.pdf
------------------------------------------------------------
    1. score: 88 <-- best
  Result: 88 -> 88 (+0) | target-reached | 4s | $0.0048
...

What to look for

Good signs:

Score increases across iterations (65 → 78 → 85 → 91)
target-reached stop reason — the prompt achieved the target score
Later files in a chain start with high scores — the prompt generalizes well
Refinement reasoning identifies specific, actionable issues

Warning signs:

plateau stop reason with a low score — the prompt can’t improve further on this document type. Try adding more diverse files to the training set.
Score drops after peaking — normal, the optimizer tracks the best and returns it
First file in a chain starts low but later files start high — the prompt is learning
All files plateau at the same score — the scoring ceiling may be the bottleneck, not the prompt

Summary table (folder mode)

At the end of a folder run, you get a per-file summary:

=== Summary ===

Per-file results:
  chemistry.pdf                  65 -> 91 (+26)  target-reached
  biology.pdf                    88 -> 88 (+0)   target-reached
  history.pdf                    82 -> 90 (+8)   target-reached

Overall improvement: 65 -> 90 (+25 on final file)
Average best score:  90
Total tokens:   45230 in / 12450 out
Total cost:     $0.0602
Total time:     1m42s

Saved files

Every run produces two outputs:

JSON results (always saved):
```
tools/benchmark-cli/results/prompt-optimizer/3-files_2026-02-11T17-08-30.json
```
Contains the full iteration history, per-file scores, token usage, and the best prompt. Use this to compare runs.
Prompt text file (when --output is specified):
```
optimized-prompt.txt
```
Contains just the refined prompt text (with metadata comments). Feed this back in with --prompt to test against other files.

Scoring

The score (0-100) is based on the same criteria as the smart cascade quality gate:

Component	Points	What it measures
Base	40	Starting score for any non-empty output
Semantic ratio	0-30	% of elements using semantic HTML (p, h1-h6, table, ul, ol, li) vs non-semantic (div, span)
Heading hierarchy	10	Headings don’t skip levels (h1 → h2 → h3, not h1 → h3)
Lang attribute	5	Output wrapping includes `lang="en"`
Title	5	Output wrapping includes a `<title>`
Content presence	10	Output has >50 characters of content
WCAG violations	-10 each	Up to -40 penalty for accessibility violations

A score of 90+ indicates high-quality semantic HTML with valid heading hierarchy and no WCAG violations.

Workflow

Recommended workflow for finding the best global prompt:

Collect 5-10 representative PDFs covering your document types (text-heavy, tables, images, equations, multi-column)
Split into training (70%) and validation (30%) sets

Run optimizer against training set:

npx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output trained-prompt.txt

Test against validation set:

npx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat trained-prompt.txt | grep -v '^#')"

If validation scores are significantly lower than training scores, add more diverse files to the training set and re-run
Once satisfied, update PAGE_VISION_PROMPT in smart-cascade-converter.ts with the optimized prompt