Where to start
Optimize the right layer first — a 30% output trim saves less than a 30% input trim when your context is 10× larger than your completions.
Prerequisite: measure before you optimize
Before changing prompts, models, or caching, you need a baseline. See Article I: immutable metering for the architectural requirements.
Without this baseline, you cannot tell whether prompt changes, model swaps, or context compression moved the needle.
Narev (opens in a new tab) provides live model pricing and cost calculation if you need normalized USD comparisons across providers.
Optimization sequence
Work through these steps in order. Each layer compounds on the previous one.
| Step | Technique | Expected impact | Guide |
|---|---|---|---|
| 1 | Metering + attribution | Prerequisite | Article I |
| 2 | Prompt hygiene | ~20–25% on high-volume templates | Prompt hygiene |
| 3 | Model routing | 60–95% depending on task mix | Model routing |
| 4 | Provider prompt caching | 41–80% on cached input | Prompt caching |
| 5 | Context hygiene | 40–60% in agentic workloads | Context hygiene |
| 6 | Output control + RAG discipline | 15–40% on verbose or over-fetched workloads | Output and RAG |
| 7 | Semantic caching | Eliminates inference on cache hits | Semantic caching |
| 8 | Advanced compression | Variable — benchmark required | — |
Why this order matters:
- Prompt hygiene is free and fast. Concise prompts with schema-enforced outputs often cut costs 20%+ with no infrastructure changes.
- Model routing is the biggest lever — but only after you know which tasks actually need frontier models.
- Caching requires stable prefixes. Fix prompt structure before expecting cache hits.
- Context hygiene matters most in agents. Single-turn chat apps may see little benefit until step 5.
- Semantic caching is workload-dependent. High-repetition FAQs benefit; unique generative tasks do not.
Anti-patterns
Avoid these common mistakes:
| Anti-pattern | Why it fails |
|---|---|
| Optimizing prompts before metering | You cannot prove savings or find the real bottleneck |
| Swapping models without benchmarks | Quality drops; teams roll back and assume cheaper models don't work |
| Caching dynamic tool output in the prefix | Invalidates cache blocks; can increase latency |
| Trimming outputs when input dominates | Input bloat in agent loops often accounts for 80%+ of spend |
| Using the largest context window by default | Encourages sending more data "just in case" |
| Algorithmic compression on short prompts | Overhead exceeds savings; risks quality degradation |
Quick wins this week
If you need immediate impact with minimal infrastructure:
- Audit your top 10 most expensive prompt templates for scaffolding waste
- Add
max_tokenseverywhere it is missing - Move static system prompts and tool definitions to the start of your prompt
- Enable provider-side prefix caching for stable content
- Set per-session token budgets with graceful degradation at 90%
Next steps
- Ready to trim prompts? Continue to Prompt hygiene
- Running agents? Jump to Context hygiene