Prompt Caching in the Agentic Vision Converter

Overview

The agentic vision pipeline sends the same PDF document to Claude on every iteration — once for the initial conversion and again on each refinement pass. Without caching, the API re-processes the full PDF input tokens every time.

Anthropic’s prompt caching lets us mark the PDF content block as cacheable. The API stores the processed input on the first call and serves it from cache on subsequent calls within the same session window (5 minutes). This cuts the input token cost on refinement iterations by 90%.

How It Works

What Gets Cached

The cache_control: { type: "ephemeral" } flag is applied to the PDF document content block in ClaudeVisionStrategy.process():

workers/api/src/services/agentic-vision-converter.ts

{
  type: 'document',
  source: {
    type: 'base64',
    media_type: 'application/pdf',
    data: params.pdfBase64,
  },
  cache_control: { type: 'ephemeral' },
}

The flag is always applied, even on single-pass conversions. The 25% write premium on a one-off call is negligible ($0.75/MTok on what is typically a small document), and unconditional application keeps the strategy stateless.

Token Flow Across Iterations

Iteration	What Happens	Token Type
1 (initial)	PDF processed and written to cache	`cache_creation_input_tokens`
2+ (refinement)	PDF served from cache	`cache_read_input_tokens`

The screenshot and prompt text change on every iteration, so only the PDF block benefits from caching.

Cache Lifetime

Anthropic’s ephemeral cache has a 5-minute TTL, refreshed on each cache hit. Since refinement iterations happen seconds apart, the cache stays warm for the entire conversion.

Cost Model

Rates are for Claude Sonnet 4 (claude-sonnet-4-6):

Token Type	Rate	vs Normal Input
Normal input	$3.00 / MTok	baseline
Cache write	$3.75 / MTok	+25% surcharge
Cache read	$0.30 / MTok	-90% discount

Savings Formula

savings      = cacheReadTokens  x ($3.00 - $0.30) / 1,000,000
writeCost    = cacheWriteTokens x $0.75 / 1,000,000
netSavings   = savings - writeCost

Example: Typical 5-Page PDF

Assume the PDF produces ~10,000 input tokens per call:

	Tokens	Cost
Iteration 1 (cache write)	10,000 write	$0.0075 extra
Iterations 2-5 (cache reads)	40,000 read	$0.0120 instead of $0.1200
Net savings		$0.1005

That’s a ~83% reduction in PDF input costs across the refinement loop.

Scaling Impact

For high-fidelity conversions (6+ iterations), savings increase with each additional iteration since the write cost is paid once while the read discount compounds:

Iterations	Cache Writes	Cache Reads	Net Savings	% Saved on PDF Input
1 (single-pass)	10,000	0	-$0.0075	-25% (write premium)
2	10,000	10,000	$0.0195	65%
4	10,000	30,000	$0.0735	82%
6	10,000	50,000	$0.1275	85%

Tracking and Logging

Log Output

After the iteration loop completes, cache stats are logged when any caching occurred:

Agentic Vision: Completed in 5 iterations. Tokens: 62,000 in / 8,400 out. Cost: $0.2120
Cache: 10,000 write tokens, 40,000 read tokens. Net savings: $0.1005

TokenUsage Fields

Three optional fields are included in the returned TokenUsage object (defined in packages/shared/src/benchmark-types.ts):

interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  model: string;
  estimatedCostUsd: number;
  cacheReadTokens?: number;      // Total tokens served from cache
  cacheWriteTokens?: number;     // Total tokens written to cache
  netCacheSavingsUsd?: number;   // Net dollar savings (reads discount minus write premium)
}

These fields are populated by convertWithAgenticVision() and propagate through to API responses, enabling cost dashboards to report cache efficiency.

Scope

Enabled for: ClaudeVisionStrategy (all Claude models)
Not applicable to: GeminiVisionStrategy (Google has a separate caching mechanism)
Chunked conversions: Each page runs its own agentic loop, so each page gets independent caching. A 10-page document benefits from caching within each page’s iteration loop, not across pages.