Prompt Caching in the Agentic Vision Converter
Overview
The agentic vision pipeline sends the same PDF document to Claude on every iteration β once for the initial conversion and again on each refinement pass. Without caching, the API re-processes the full PDF input tokens every time.
Anthropicβs prompt caching lets us mark the PDF content block as cacheable. The API stores the processed input on the first call and serves it from cache on subsequent calls within the same session window (5 minutes). This cuts the input token cost on refinement iterations by 90%.
How It Works
What Gets Cached
The cache_control: { type: "ephemeral" } flag is applied to the PDF document content block in ClaudeVisionStrategy.process():
workers/api/src/services/agentic-vision-converter.ts{ type: 'document', source: { type: 'base64', media_type: 'application/pdf', data: params.pdfBase64, }, cache_control: { type: 'ephemeral' },}The flag is always applied, even on single-pass conversions. The 25% write premium on a one-off call is negligible ($0.75/MTok on what is typically a small document), and unconditional application keeps the strategy stateless.
Token Flow Across Iterations
| Iteration | What Happens | Token Type |
|---|---|---|
| 1 (initial) | PDF processed and written to cache | cache_creation_input_tokens |
| 2+ (refinement) | PDF served from cache | cache_read_input_tokens |
The screenshot and prompt text change on every iteration, so only the PDF block benefits from caching.
Cache Lifetime
Anthropicβs ephemeral cache has a 5-minute TTL, refreshed on each cache hit. Since refinement iterations happen seconds apart, the cache stays warm for the entire conversion.
Cost Model
Rates are for Claude Sonnet 4 (claude-sonnet-4-6):
| Token Type | Rate | vs Normal Input |
|---|---|---|
| Normal input | $3.00 / MTok | baseline |
| Cache write | $3.75 / MTok | +25% surcharge |
| Cache read | $0.30 / MTok | -90% discount |
Savings Formula
savings = cacheReadTokens x ($3.00 - $0.30) / 1,000,000writeCost = cacheWriteTokens x $0.75 / 1,000,000netSavings = savings - writeCostExample: Typical 5-Page PDF
Assume the PDF produces ~10,000 input tokens per call:
| Tokens | Cost | |
|---|---|---|
| Iteration 1 (cache write) | 10,000 write | $0.0075 extra |
| Iterations 2-5 (cache reads) | 40,000 read | $0.0120 instead of $0.1200 |
| Net savings | $0.1005 |
Thatβs a ~83% reduction in PDF input costs across the refinement loop.
Scaling Impact
For high-fidelity conversions (6+ iterations), savings increase with each additional iteration since the write cost is paid once while the read discount compounds:
| Iterations | Cache Writes | Cache Reads | Net Savings | % Saved on PDF Input |
|---|---|---|---|---|
| 1 (single-pass) | 10,000 | 0 | -$0.0075 | -25% (write premium) |
| 2 | 10,000 | 10,000 | $0.0195 | 65% |
| 4 | 10,000 | 30,000 | $0.0735 | 82% |
| 6 | 10,000 | 50,000 | $0.1275 | 85% |
Tracking and Logging
Log Output
After the iteration loop completes, cache stats are logged when any caching occurred:
Agentic Vision: Completed in 5 iterations. Tokens: 62,000 in / 8,400 out. Cost: $0.2120Cache: 10,000 write tokens, 40,000 read tokens. Net savings: $0.1005TokenUsage Fields
Three optional fields are included in the returned TokenUsage object (defined in packages/shared/src/benchmark-types.ts):
interface TokenUsage { inputTokens: number; outputTokens: number; model: string; estimatedCostUsd: number; cacheReadTokens?: number; // Total tokens served from cache cacheWriteTokens?: number; // Total tokens written to cache netCacheSavingsUsd?: number; // Net dollar savings (reads discount minus write premium)}These fields are populated by convertWithAgenticVision() and propagate through to API responses, enabling cost dashboards to report cache efficiency.
Scope
- Enabled for:
ClaudeVisionStrategy(all Claude models) - Not applicable to:
GeminiVisionStrategy(Google has a separate caching mechanism) - Chunked conversions: Each page runs its own agentic loop, so each page gets independent caching. A 10-page document benefits from caching within each pageβs iteration loop, not across pages.