Prompt caching

Prompt caching

Provider prompt caching stores the key-value representations of a prompt prefix so subsequent requests with the same prefix skip recomputation.

This complements Article I: immutable metering, which requires cached tokens as a distinct ledger line item.

Expected impact

A PwC study across OpenAI, Anthropic, and Google (2026) found prompt caching delivered:

MetricRange
Cost reduction41–80%
Time-to-first-token improvement13–31%

Results depend on cache hit rate and prefix stability. Target 70%+ hit rate for meaningful savings.

Prompt structure: stable prefix, dynamic suffix

Caching works on prefix matching. Content at the start of your prompt must be byte-identical across requests. Variable content belongs at the end.

┌─────────────────────────────────────┐
│ STABLE PREFIX (cached)              │
│  • System prompt                    │
│  • Tool definitions                 │
│  • Reference documents              │
│  • Few-shot examples (if fixed)     │
├─────────────────────────────────────┤
│ DYNAMIC SUFFIX (not cached)         │
│  • User query                       │
│  • Session-specific context         │
│  • Conversation history             │
│  • Tool results from this session   │
└─────────────────────────────────────┘

If you have been mixing static and dynamic content throughout your prompt, separating them is often the highest-ROI engineering task of the week.

Provider-agnostic rules

RuleDetail
Minimum prefix sizeOften ~1,024 tokens for effective caching (provider-dependent)
TTLStandard is ~5 minutes; longer TTL may cost more to write
Cache isolationCaches are typically workspace-scoped, not org-wide
Hit rate target70%+ for optimum savings
Monitor cache_read_tokensTrack separately from standard input tokens

Cache-busters to avoid

These invalidate your prefix and force full recomputation:

  • Timestamps or dates in the system prompt (Today's date is...)
  • Shuffled few-shot example order between requests
  • Dynamic tool results placed before the user query
  • Per-request unique identifiers in the system prompt
  • Model or version strings that change on deploy without cache invalidation strategy

When caching hurts

Naive full-context caching can increase latency when dynamic content is included in the cached prefix. The winning strategy is targeted:

  • Cache: stable system prompt, tool definitions, reference docs
  • Do not cache: tool results, conversation history, per-user context

If your cache hit rate is below 30%, you are paying write costs without meaningful read savings. Fix prompt structure before investing in longer TTLs.

Measuring success

Track these metrics per feature:

MetricHealthy range
Cache hit rate>70%
Cached input as % of total inputRising over time
Cost per request (cached vs uncached)Cached significantly lower
P50 latencyImproved or flat

Use Narev's pricing API (opens in a new tab) to calculate cost with cache_read_tokens and cache_write_tokens as separate line items.

Relationship to semantic caching

Provider prompt caching discounts repeated prefix processing within a session or short TTL window. Semantic caching is a separate layer that can bypass the LLM entirely for semantically similar queries. Use both:

  1. Provider caching for stable prefixes (Tier 1)
  2. Semantic caching for repetitive user queries (Tier 3)

Tokenminning · Built by Narev