Semantic caching

Semantic caching stores previous LLM responses indexed by query meaning — returning cached answers without calling the model when a new query is similar enough.

This is a Tier 3 optimization. Implement metering, prompt hygiene, and provider caching first.

Expected impact

Semantic caching eliminates the full inference call on cache hits — saving 100% of tokens for matched queries. ROI depends entirely on query repetition:

Workload	Cache hit potential
FAQ / support bot	High (60–80% hit rate possible)
Classification with limited categories	Moderate
Unique generative content	Near zero
Code generation	Low (queries are highly variable)

When to use semantic caching

Good fit:

High-traffic endpoints with repetitive queries
Support bots answering common questions
Classification tasks with finite input patterns
Tool-result lookups that change infrequently

Poor fit:

Unique creative generation per request
Real-time data queries (prices, inventory, news)
Tasks where a wrong cached answer is worse than no answer
Low-traffic endpoints (cache overhead exceeds savings)

Architecture: multi-tier caching

A layered approach covers the most ground:

Request
  ↓
┌─────────────────────┐
│ Tier 1: Exact match │  ← identical query string (sub-ms)
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 2: Semantic    │  ← vector similarity (low-ms)
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 3: Provider    │  ← prefix cache (see prompt caching guide)
│         prefix cache│
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 4: Full LLM    │  ← inference call
│         inference   │
└─────────────────────┘

Each tier catches queries the previous tier missed. Cache the LLM response at Tier 2 for future semantic matches.

Similarity thresholds

Set thresholds based on the cost of a wrong answer:

Application	Threshold	Rationale
Customer-facing support	0.92–0.95	Wrong answer damages trust
Internal tooling	0.85–0.90	Lower risk, higher hit rate
Code queries	0.90–0.95	Semantically adjacent ≠ functionally equivalent

Below 0.85, you risk returning cached responses to queries that are semantically adjacent but factually different. A wrong cached answer is worse than no cache.

TTL and staleness

Cached responses do not know when their source data has changed. Set aggressive TTL for dynamic data:

Data type	TTL guidance
Static documentation	Hours to days
Product FAQs	Hours
Prices, inventory, news	Minutes or no cache
User-specific data	Per-session or no cache

For data that changes frequently, semantic caching may cause more harm than benefit. Provider prompt caching is a better fit for stable prefixes with dynamic suffixes.

Measuring success

Metric	Healthy range
Semantic cache hit rate	30–60% for repetitive workloads
Cost per request (cached vs uncached)	Cached ≈ $0 inference cost
Stale-response rate	Less than 0.1% of cache hits
P50 latency (cache hit)	Sub-100ms

Relationship to provider prompt caching

These are complementary, not competing:

Layer	What it caches	What it saves
Provider prompt caching	Prefix KV representations	Input token processing cost
Semantic caching	Full LLM responses	Entire inference call

Use provider caching for stable prefixes within a session. Use semantic caching to skip inference entirely for repeated question patterns across sessions.

Output and RAG Press