Skip to content

Prompt Optimizer

Iteratively refines the PDF-to-HTML conversion prompt using Gemini Flash. Each iteration converts a PDF page, scores the output, then asks Gemini to analyze what went wrong and write an improved prompt. Repeats until quality is sufficient or improvement plateaus.

Cost: ~$0.005 per iteration. A full 10-iteration run costs ~$0.05 per file.

Prerequisites

Set your Gemini API key in tools/benchmark-cli/.env:

GEMINI_API_KEY=your-key-here

Usage

All commands run from tools/benchmark-cli/.

Single file

Terminal window
npx tsx src/optimize-prompt.ts --file ~/Documents/sample.pdf

Chains the best prompt from each file into the next. This produces a prompt that generalizes across document types rather than overfitting to one.

Terminal window
npx tsx src/optimize-prompt.ts --dir ~/Documents/pdfs/ --output optimized-prompt.txt

Test a refined prompt against a different set of files

Terminal window
# Round 1: optimize against training set
npx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output round1.txt
# Round 2: test against a different set, starting from round 1's prompt
npx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat round1.txt | grep -v '^#')" --output round2.txt

Options

FlagDefaultDescription
--file <path>β€”Single PDF file to optimize against
--dir <path>β€”Folder of PDFs (chains prompt across files)
--page <n>1Page number to use from each file
--max-iterations <n>10Max refinement iterations per file
--target-score <n>90Stop when score reaches this threshold
--patience <n>3Stop after N iterations without improvement
--model <name>gemini-3-flash-previewGemini model to use
--prompt "<text>"built-in defaultCustom initial prompt
--output <path>β€”Save best prompt to a text file

Output

Console output

The optimizer prints iteration-by-iteration progress:

=== Prompt Optimizer ===
Files: 3 PDF(s)
Page: 1
Model: gemini-3-flash-preview
Max iterations: 10 per file
Target score: 90
Mode: chained (best prompt carries forward to next file)
[1/3] chemistry.pdf
------------------------------------------------------------
Prompt Optimizer: iteration 1/10 β€” converting...
Prompt Optimizer: iteration 1 β€” score 65 (2340ms)
Prompt Optimizer: iteration 1 β€” refining prompt...
Prompt Optimizer: iteration 2/10 β€” converting...
Prompt Optimizer: iteration 2 β€” score 78 (1890ms)
...
1. score: 65
Low semantic ratio (45%), too many divs. Added instructions to use semantic HTML tags.
2. score: 78 (+13)
Missing table headers. Added explicit <th> instructions for tabular data.
3. score: 85 (+7)
Heading hierarchy skipped h2. Added sequential heading level requirement.
4. score: 91 (+6) <-- best
Result: 65 -> 91 (+26) | target-reached | 32s | $0.0234
[2/3] biology.pdf
------------------------------------------------------------
1. score: 88 <-- best
Result: 88 -> 88 (+0) | target-reached | 4s | $0.0048
...

What to look for

Good signs:

  • Score increases across iterations (65 β†’ 78 β†’ 85 β†’ 91)
  • target-reached stop reason β€” the prompt achieved the target score
  • Later files in a chain start with high scores β€” the prompt generalizes well
  • Refinement reasoning identifies specific, actionable issues

Warning signs:

  • plateau stop reason with a low score β€” the prompt can’t improve further on this document type. Try adding more diverse files to the training set.
  • Score drops after peaking β€” normal, the optimizer tracks the best and returns it
  • First file in a chain starts low but later files start high β€” the prompt is learning
  • All files plateau at the same score β€” the scoring ceiling may be the bottleneck, not the prompt

Summary table (folder mode)

At the end of a folder run, you get a per-file summary:

=== Summary ===
Per-file results:
chemistry.pdf 65 -> 91 (+26) target-reached
biology.pdf 88 -> 88 (+0) target-reached
history.pdf 82 -> 90 (+8) target-reached
Overall improvement: 65 -> 90 (+25 on final file)
Average best score: 90
Total tokens: 45230 in / 12450 out
Total cost: $0.0602
Total time: 1m42s

Saved files

Every run produces two outputs:

  1. JSON results (always saved):

    tools/benchmark-cli/results/prompt-optimizer/3-files_2026-02-11T17-08-30.json

    Contains the full iteration history, per-file scores, token usage, and the best prompt. Use this to compare runs.

  2. Prompt text file (when --output is specified):

    optimized-prompt.txt

    Contains just the refined prompt text (with metadata comments). Feed this back in with --prompt to test against other files.

Scoring

The score (0-100) is based on the same criteria as the smart cascade quality gate:

ComponentPointsWhat it measures
Base40Starting score for any non-empty output
Semantic ratio0-30% of elements using semantic HTML (p, h1-h6, table, ul, ol, li) vs non-semantic (div, span)
Heading hierarchy10Headings don’t skip levels (h1 β†’ h2 β†’ h3, not h1 β†’ h3)
Lang attribute5Output wrapping includes lang="en"
Title5Output wrapping includes a <title>
Content presence10Output has >50 characters of content
WCAG violations-10 eachUp to -40 penalty for accessibility violations

A score of 90+ indicates high-quality semantic HTML with valid heading hierarchy and no WCAG violations.

Workflow

Recommended workflow for finding the best global prompt:

  1. Collect 5-10 representative PDFs covering your document types (text-heavy, tables, images, equations, multi-column)
  2. Split into training (70%) and validation (30%) sets
  3. Run optimizer against training set:
    Terminal window
    npx tsx src/optimize-prompt.ts --dir ~/pdfs/training/ --output trained-prompt.txt
  4. Test against validation set:
    Terminal window
    npx tsx src/optimize-prompt.ts --dir ~/pdfs/validation/ --prompt "$(cat trained-prompt.txt | grep -v '^#')"
  5. If validation scores are significantly lower than training scores, add more diverse files to the training set and re-run
  6. Once satisfied, update PAGE_VISION_PROMPT in smart-cascade-converter.ts with the optimized prompt