Semantic caching

Semantic caching

Semantic caching stores previous LLM responses indexed by query meaning — returning cached answers without calling the model when a new query is similar enough.

This is a Tier 3 optimization. Implement metering, prompt hygiene, and provider caching first.

Expected impact

Semantic caching eliminates the full inference call on cache hits — saving 100% of tokens for matched queries. ROI depends entirely on query repetition:

WorkloadCache hit potential
FAQ / support botHigh (60–80% hit rate possible)
Classification with limited categoriesModerate
Unique generative contentNear zero
Code generationLow (queries are highly variable)

When to use semantic caching

Good fit:

  • High-traffic endpoints with repetitive queries
  • Support bots answering common questions
  • Classification tasks with finite input patterns
  • Tool-result lookups that change infrequently

Poor fit:

  • Unique creative generation per request
  • Real-time data queries (prices, inventory, news)
  • Tasks where a wrong cached answer is worse than no answer
  • Low-traffic endpoints (cache overhead exceeds savings)

Architecture: multi-tier caching

A layered approach covers the most ground:

Request

┌─────────────────────┐
│ Tier 1: Exact match │  ← identical query string (sub-ms)
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 2: Semantic    │  ← vector similarity (low-ms)
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 3: Provider    │  ← prefix cache (see prompt caching guide)
│         prefix cache│
└─────────┬───────────┘
          ↓ miss
┌─────────────────────┐
│ Tier 4: Full LLM    │  ← inference call
│         inference   │
└─────────────────────┘

Each tier catches queries the previous tier missed. Cache the LLM response at Tier 2 for future semantic matches.

Similarity thresholds

Set thresholds based on the cost of a wrong answer:

ApplicationThresholdRationale
Customer-facing support0.92–0.95Wrong answer damages trust
Internal tooling0.85–0.90Lower risk, higher hit rate
Code queries0.90–0.95Semantically adjacent ≠ functionally equivalent

Below 0.85, you risk returning cached responses to queries that are semantically adjacent but factually different. A wrong cached answer is worse than no cache.

TTL and staleness

Cached responses do not know when their source data has changed. Set aggressive TTL for dynamic data:

Data typeTTL guidance
Static documentationHours to days
Product FAQsHours
Prices, inventory, newsMinutes or no cache
User-specific dataPer-session or no cache

For data that changes frequently, semantic caching may cause more harm than benefit. Provider prompt caching is a better fit for stable prefixes with dynamic suffixes.

Measuring success

MetricHealthy range
Semantic cache hit rate30–60% for repetitive workloads
Cost per request (cached vs uncached)Cached ≈ $0 inference cost
Stale-response rateLess than 0.1% of cache hits
P50 latency (cache hit)Sub-100ms

Relationship to provider prompt caching

These are complementary, not competing:

LayerWhat it cachesWhat it saves
Provider prompt cachingPrefix KV representationsInput token processing cost
Semantic cachingFull LLM responsesEntire inference call

Use provider caching for stable prefixes within a session. Use semantic caching to skip inference entirely for repeated question patterns across sessions.


Tokenminning · Built by Narev