Where to start

Where to start

Optimize the right layer first — a 30% output trim saves less than a 30% input trim when your context is 10× larger than your completions.

Prerequisite: measure before you optimize

Before changing prompts, models, or caching, you need a baseline. See Article I: immutable metering for the architectural requirements.

Without this baseline, you cannot tell whether prompt changes, model swaps, or context compression moved the needle.

Narev (opens in a new tab) provides live model pricing and cost calculation if you need normalized USD comparisons across providers.

Optimization sequence

Work through these steps in order. Each layer compounds on the previous one.

StepTechniqueExpected impactGuide
1Metering + attributionPrerequisiteArticle I
2Prompt hygiene~20–25% on high-volume templatesPrompt hygiene
3Model routing60–95% depending on task mixModel routing
4Provider prompt caching41–80% on cached inputPrompt caching
5Context hygiene40–60% in agentic workloadsContext hygiene
6Output control + RAG discipline15–40% on verbose or over-fetched workloadsOutput and RAG
7Semantic cachingEliminates inference on cache hitsSemantic caching
8Advanced compressionVariable — benchmark required

Why this order matters:

  1. Prompt hygiene is free and fast. Concise prompts with schema-enforced outputs often cut costs 20%+ with no infrastructure changes.
  2. Model routing is the biggest lever — but only after you know which tasks actually need frontier models.
  3. Caching requires stable prefixes. Fix prompt structure before expecting cache hits.
  4. Context hygiene matters most in agents. Single-turn chat apps may see little benefit until step 5.
  5. Semantic caching is workload-dependent. High-repetition FAQs benefit; unique generative tasks do not.

Anti-patterns

Avoid these common mistakes:

Anti-patternWhy it fails
Optimizing prompts before meteringYou cannot prove savings or find the real bottleneck
Swapping models without benchmarksQuality drops; teams roll back and assume cheaper models don't work
Caching dynamic tool output in the prefixInvalidates cache blocks; can increase latency
Trimming outputs when input dominatesInput bloat in agent loops often accounts for 80%+ of spend
Using the largest context window by defaultEncourages sending more data "just in case"
Algorithmic compression on short promptsOverhead exceeds savings; risks quality degradation

Quick wins this week

If you need immediate impact with minimal infrastructure:

  1. Audit your top 10 most expensive prompt templates for scaffolding waste
  2. Add max_tokens everywhere it is missing
  3. Move static system prompts and tool definitions to the start of your prompt
  4. Enable provider-side prefix caching for stable content
  5. Set per-session token budgets with graceful degradation at 90%

Next steps


Tokenminning · Built by Narev