Semantic caching
Semantic caching stores previous LLM responses indexed by query meaning — returning cached answers without calling the model when a new query is similar enough.
This is a Tier 3 optimization. Implement metering, prompt hygiene, and provider caching first.
Expected impact
Semantic caching eliminates the full inference call on cache hits — saving 100% of tokens for matched queries. ROI depends entirely on query repetition:
| Workload | Cache hit potential |
|---|---|
| FAQ / support bot | High (60–80% hit rate possible) |
| Classification with limited categories | Moderate |
| Unique generative content | Near zero |
| Code generation | Low (queries are highly variable) |
When to use semantic caching
Good fit:
- High-traffic endpoints with repetitive queries
- Support bots answering common questions
- Classification tasks with finite input patterns
- Tool-result lookups that change infrequently
Poor fit:
- Unique creative generation per request
- Real-time data queries (prices, inventory, news)
- Tasks where a wrong cached answer is worse than no answer
- Low-traffic endpoints (cache overhead exceeds savings)
Architecture: multi-tier caching
A layered approach covers the most ground:
Request
↓
┌─────────────────────┐
│ Tier 1: Exact match │ ← identical query string (sub-ms)
└─────────┬───────────┘
↓ miss
┌─────────────────────┐
│ Tier 2: Semantic │ ← vector similarity (low-ms)
└─────────┬───────────┘
↓ miss
┌─────────────────────┐
│ Tier 3: Provider │ ← prefix cache (see prompt caching guide)
│ prefix cache│
└─────────┬───────────┘
↓ miss
┌─────────────────────┐
│ Tier 4: Full LLM │ ← inference call
│ inference │
└─────────────────────┘Each tier catches queries the previous tier missed. Cache the LLM response at Tier 2 for future semantic matches.
Similarity thresholds
Set thresholds based on the cost of a wrong answer:
| Application | Threshold | Rationale |
|---|---|---|
| Customer-facing support | 0.92–0.95 | Wrong answer damages trust |
| Internal tooling | 0.85–0.90 | Lower risk, higher hit rate |
| Code queries | 0.90–0.95 | Semantically adjacent ≠ functionally equivalent |
Below 0.85, you risk returning cached responses to queries that are semantically adjacent but factually different. A wrong cached answer is worse than no cache.
TTL and staleness
Cached responses do not know when their source data has changed. Set aggressive TTL for dynamic data:
| Data type | TTL guidance |
|---|---|
| Static documentation | Hours to days |
| Product FAQs | Hours |
| Prices, inventory, news | Minutes or no cache |
| User-specific data | Per-session or no cache |
For data that changes frequently, semantic caching may cause more harm than benefit. Provider prompt caching is a better fit for stable prefixes with dynamic suffixes.
Measuring success
| Metric | Healthy range |
|---|---|
| Semantic cache hit rate | 30–60% for repetitive workloads |
| Cost per request (cached vs uncached) | Cached ≈ $0 inference cost |
| Stale-response rate | Less than 0.1% of cache hits |
| P50 latency (cache hit) | Sub-100ms |
Relationship to provider prompt caching
These are complementary, not competing:
| Layer | What it caches | What it saves |
|---|---|---|
| Provider prompt caching | Prefix KV representations | Input token processing cost |
| Semantic caching | Full LLM responses | Entire inference call |
Use provider caching for stable prefixes within a session. Use semantic caching to skip inference entirely for repeated question patterns across sessions.