Where to start

Optimize the right layer first — a 30% output trim saves less than a 30% input trim when your context is 10× larger than your completions.

Prerequisite: measure before you optimize

Before changing prompts, models, or caching, you need a baseline. See Article I: immutable metering for the architectural requirements.

Without this baseline, you cannot tell whether prompt changes, model swaps, or context compression moved the needle.

Narev (opens in a new tab) provides live model pricing and cost calculation if you need normalized USD comparisons across providers.

Work through these steps in order. Each layer compounds on the previous one.

Step	Technique	Expected impact	Guide
1	Metering + attribution	Prerequisite	Article I
2	Prompt hygiene	~20–25% on high-volume templates	Prompt hygiene
3	Model routing	60–95% depending on task mix	Model routing
4	Provider prompt caching	41–80% on cached input	Prompt caching
5	Context hygiene	40–60% in agentic workloads	Context hygiene
6	Output control + RAG discipline	15–40% on verbose or over-fetched workloads	Output and RAG
7	Semantic caching	Eliminates inference on cache hits	Semantic caching
8	Advanced compression	Variable — benchmark required	—

Why this order matters:

Prompt hygiene is free and fast. Concise prompts with schema-enforced outputs often cut costs 20%+ with no infrastructure changes.
Model routing is the biggest lever — but only after you know which tasks actually need frontier models.
Caching requires stable prefixes. Fix prompt structure before expecting cache hits.
Context hygiene matters most in agents. Single-turn chat apps may see little benefit until step 5.
Semantic caching is workload-dependent. High-repetition FAQs benefit; unique generative tasks do not.

Avoid these common mistakes:

Anti-pattern	Why it fails
Optimizing prompts before metering	You cannot prove savings or find the real bottleneck
Swapping models without benchmarks	Quality drops; teams roll back and assume cheaper models don't work
Caching dynamic tool output in the prefix	Invalidates cache blocks; can increase latency
Trimming outputs when input dominates	Input bloat in agent loops often accounts for 80%+ of spend
Using the largest context window by default	Encourages sending more data "just in case"
Algorithmic compression on short prompts	Overhead exceeds savings; risks quality degradation

If you need immediate impact with minimal infrastructure: