Prompt caching
Provider prompt caching stores the key-value representations of a prompt prefix so subsequent requests with the same prefix skip recomputation.
This complements Article I: immutable metering, which requires cached tokens as a distinct ledger line item.
Expected impact
A PwC study across OpenAI, Anthropic, and Google (2026) found prompt caching delivered:
| Metric | Range |
|---|---|
| Cost reduction | 41–80% |
| Time-to-first-token improvement | 13–31% |
Results depend on cache hit rate and prefix stability. Target 70%+ hit rate for meaningful savings.
Prompt structure: stable prefix, dynamic suffix
Caching works on prefix matching. Content at the start of your prompt must be byte-identical across requests. Variable content belongs at the end.
┌─────────────────────────────────────┐
│ STABLE PREFIX (cached) │
│ • System prompt │
│ • Tool definitions │
│ • Reference documents │
│ • Few-shot examples (if fixed) │
├─────────────────────────────────────┤
│ DYNAMIC SUFFIX (not cached) │
│ • User query │
│ • Session-specific context │
│ • Conversation history │
│ • Tool results from this session │
└─────────────────────────────────────┘If you have been mixing static and dynamic content throughout your prompt, separating them is often the highest-ROI engineering task of the week.
Provider-agnostic rules
| Rule | Detail |
|---|---|
| Minimum prefix size | Often ~1,024 tokens for effective caching (provider-dependent) |
| TTL | Standard is ~5 minutes; longer TTL may cost more to write |
| Cache isolation | Caches are typically workspace-scoped, not org-wide |
| Hit rate target | 70%+ for optimum savings |
Monitor cache_read_tokens | Track separately from standard input tokens |
Cache-busters to avoid
These invalidate your prefix and force full recomputation:
- Timestamps or dates in the system prompt (
Today's date is...) - Shuffled few-shot example order between requests
- Dynamic tool results placed before the user query
- Per-request unique identifiers in the system prompt
- Model or version strings that change on deploy without cache invalidation strategy
When caching hurts
Naive full-context caching can increase latency when dynamic content is included in the cached prefix. The winning strategy is targeted:
- Cache: stable system prompt, tool definitions, reference docs
- Do not cache: tool results, conversation history, per-user context
If your cache hit rate is below 30%, you are paying write costs without meaningful read savings. Fix prompt structure before investing in longer TTLs.
Measuring success
Track these metrics per feature:
| Metric | Healthy range |
|---|---|
| Cache hit rate | >70% |
| Cached input as % of total input | Rising over time |
| Cost per request (cached vs uncached) | Cached significantly lower |
| P50 latency | Improved or flat |
Use Narev's pricing API (opens in a new tab) to calculate cost with cache_read_tokens and cache_write_tokens as separate line items.
Relationship to semantic caching
Provider prompt caching discounts repeated prefix processing within a session or short TTL window. Semantic caching is a separate layer that can bypass the LLM entirely for semantically similar queries. Use both:
- Provider caching for stable prefixes (Tier 1)
- Semantic caching for repetitive user queries (Tier 3)