Prompt caching

Provider prompt caching stores the key-value representations of a prompt prefix so subsequent requests with the same prefix skip recomputation.

This complements Article I: immutable metering, which requires cached tokens as a distinct ledger line item.

Expected impact

A PwC study across OpenAI, Anthropic, and Google (2026) found prompt caching delivered:

Metric	Range
Cost reduction	41–80%
Time-to-first-token improvement	13–31%

Results depend on cache hit rate and prefix stability. Target 70%+ hit rate for meaningful savings.

Prompt structure: stable prefix, dynamic suffix

Caching works on prefix matching. Content at the start of your prompt must be byte-identical across requests. Variable content belongs at the end.

┌─────────────────────────────────────┐
│ STABLE PREFIX (cached)              │
│  • System prompt                    │
│  • Tool definitions                 │
│  • Reference documents              │
│  • Few-shot examples (if fixed)     │
├─────────────────────────────────────┤
│ DYNAMIC SUFFIX (not cached)         │
│  • User query                       │
│  • Session-specific context         │
│  • Conversation history             │
│  • Tool results from this session   │
└─────────────────────────────────────┘

If you have been mixing static and dynamic content throughout your prompt, separating them is often the highest-ROI engineering task of the week.

Provider-agnostic rules

Rule	Detail
Minimum prefix size	Often ~1,024 tokens for effective caching (provider-dependent)
TTL	Standard is ~5 minutes; longer TTL may cost more to write
Cache isolation	Caches are typically workspace-scoped, not org-wide
Hit rate target	70%+ for optimum savings
Monitor `cache_read_tokens`	Track separately from standard input tokens

Cache-busters to avoid

These invalidate your prefix and force full recomputation:

Timestamps or dates in the system prompt (Today's date is...)
Shuffled few-shot example order between requests
Dynamic tool results placed before the user query
Per-request unique identifiers in the system prompt
Model or version strings that change on deploy without cache invalidation strategy

When caching hurts

Naive full-context caching can increase latency when dynamic content is included in the cached prefix. The winning strategy is targeted:

Cache: stable system prompt, tool definitions, reference docs
Do not cache: tool results, conversation history, per-user context

If your cache hit rate is below 30%, you are paying write costs without meaningful read savings. Fix prompt structure before investing in longer TTLs.

Measuring success

Track these metrics per feature:

Metric	Healthy range
Cache hit rate	>70%
Cached input as % of total input	Rising over time
Cost per request (cached vs uncached)	Cached significantly lower
P50 latency	Improved or flat

Use Narev's pricing API (opens in a new tab) to calculate cost with cache_read_tokens and cache_write_tokens as separate line items.

Relationship to semantic caching

Provider prompt caching discounts repeated prefix processing within a session or short TTL window. Semantic caching is a separate layer that can bypass the LLM entirely for semantically similar queries. Use both:

Provider caching for stable prefixes (Tier 1)
Semantic caching for repetitive user queries (Tier 3)

Prompt hygiene Context hygiene