Prompt Optimizer
Iteratively refines the PDF-to-HTML conversion prompt using Gemini Flash. Each iteration converts a PDF page, scores the output, then asks Gemini to analyze what went wrong and write an improved prompt. Repeats until quality is sufficient or improvement plateaus.
Cost: ~$0.005 per iteration. A full 10-iteration run costs ~$0.05 per file.
Prerequisites
Set your Gemini API key in tools/benchmark-cli/.env:
GEMINI_API_KEY=your-key-hereUsage
All commands run from tools/benchmark-cli/.
Single file
npx tsx src/optimize-prompt.ts --file ~/Documents/sample.pdfFolder of PDFs (recommended)
Chains the best prompt from each file into the next. This produces a prompt that generalizes across document types rather than overfitting to one.
npx tsx src/optimize-prompt.ts --dir ~/Documents/pdfs/ --output optimized-prompt.txtTest a refined prompt against a different set of files
# Round 1: optimize against training setnpx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output round1.txt
# Round 2: test against a different set, starting from round 1's promptnpx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat round1.txt | grep -v '^#')" --output round2.txtOptions
| Flag | Default | Description |
|---|---|---|
--file <path> | β | Single PDF file to optimize against |
--dir <path> | β | Folder of PDFs (chains prompt across files) |
--page <n> | 1 | Page number to use from each file |
--max-iterations <n> | 10 | Max refinement iterations per file |
--target-score <n> | 90 | Stop when score reaches this threshold |
--patience <n> | 3 | Stop after N iterations without improvement |
--model <name> | gemini-3-flash-preview | Gemini model to use |
--prompt "<text>" | built-in default | Custom initial prompt |
--output <path> | β | Save best prompt to a text file |
Output
Console output
The optimizer prints iteration-by-iteration progress:
=== Prompt Optimizer ===Files: 3 PDF(s)Page: 1Model: gemini-3-flash-previewMax iterations: 10 per fileTarget score: 90Mode: chained (best prompt carries forward to next file)
[1/3] chemistry.pdf------------------------------------------------------------Prompt Optimizer: iteration 1/10 β converting...Prompt Optimizer: iteration 1 β score 65 (2340ms)Prompt Optimizer: iteration 1 β refining prompt...Prompt Optimizer: iteration 2/10 β converting...Prompt Optimizer: iteration 2 β score 78 (1890ms)... 1. score: 65 Low semantic ratio (45%), too many divs. Added instructions to use semantic HTML tags. 2. score: 78 (+13) Missing table headers. Added explicit <th> instructions for tabular data. 3. score: 85 (+7) Heading hierarchy skipped h2. Added sequential heading level requirement. 4. score: 91 (+6) <-- best Result: 65 -> 91 (+26) | target-reached | 32s | $0.0234
[2/3] biology.pdf------------------------------------------------------------ 1. score: 88 <-- best Result: 88 -> 88 (+0) | target-reached | 4s | $0.0048...What to look for
Good signs:
- Score increases across iterations (65 β 78 β 85 β 91)
target-reachedstop reason β the prompt achieved the target score- Later files in a chain start with high scores β the prompt generalizes well
- Refinement reasoning identifies specific, actionable issues
Warning signs:
plateaustop reason with a low score β the prompt canβt improve further on this document type. Try adding more diverse files to the training set.- Score drops after peaking β normal, the optimizer tracks the best and returns it
- First file in a chain starts low but later files start high β the prompt is learning
- All files plateau at the same score β the scoring ceiling may be the bottleneck, not the prompt
Summary table (folder mode)
At the end of a folder run, you get a per-file summary:
=== Summary ===
Per-file results: chemistry.pdf 65 -> 91 (+26) target-reached biology.pdf 88 -> 88 (+0) target-reached history.pdf 82 -> 90 (+8) target-reached
Overall improvement: 65 -> 90 (+25 on final file)Average best score: 90Total tokens: 45230 in / 12450 outTotal cost: $0.0602Total time: 1m42sSaved files
Every run produces two outputs:
-
JSON results (always saved):
tools/benchmark-cli/results/prompt-optimizer/3-files_2026-02-11T17-08-30.jsonContains the full iteration history, per-file scores, token usage, and the best prompt. Use this to compare runs.
-
Prompt text file (when
--outputis specified):optimized-prompt.txtContains just the refined prompt text (with metadata comments). Feed this back in with
--promptto test against other files.
Scoring
The score (0-100) is based on the same criteria as the smart cascade quality gate:
| Component | Points | What it measures |
|---|---|---|
| Base | 40 | Starting score for any non-empty output |
| Semantic ratio | 0-30 | % of elements using semantic HTML (p, h1-h6, table, ul, ol, li) vs non-semantic (div, span) |
| Heading hierarchy | 10 | Headings donβt skip levels (h1 β h2 β h3, not h1 β h3) |
| Lang attribute | 5 | Output wrapping includes lang="en" |
| Title | 5 | Output wrapping includes a <title> |
| Content presence | 10 | Output has >50 characters of content |
| WCAG violations | -10 each | Up to -40 penalty for accessibility violations |
A score of 90+ indicates high-quality semantic HTML with valid heading hierarchy and no WCAG violations.
Workflow
Recommended workflow for finding the best global prompt:
- Collect 5-10 representative PDFs covering your document types (text-heavy, tables, images, equations, multi-column)
- Split into training (70%) and validation (30%) sets
- Run optimizer against training set:
Terminal window npx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output trained-prompt.txt - Test against validation set:
Terminal window npx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat trained-prompt.txt | grep -v '^#')" - If validation scores are significantly lower than training scores, add more diverse files to the training set and re-run
- Once satisfied, update
PAGE_VISION_PROMPTinsmart-cascade-converter.tswith the optimized prompt