Context inflation

Context inflation

Context inflation is the gradual increase in tokens consumed per AI request over time, even when per-token prices stay flat.

A team that sends the same number of API calls each month can still see costs multiply because each call carries more system prompts, conversation history, retrieved documents, tool outputs, and reasoning tokens than it did six months ago.

Why it happens

Modern models and workflows consume more tokens per task through several compounding trends:

  • Longer context windows: Limits of 128K → 200K → 1M tokens encourage sending more data "just in case"
  • Test-time compute: Models that reason or "think" longer generate additional internal tokens
  • Agentic workflows: Multi-step agents iterate, verify, and refine — multiplying token usage per user task
  • RAG over-fetching: Retrieval systems pull large document chunks instead of the minimum relevant context

What it costs

The math is simple but easy to overlook. A query that once returned 500 tokens might now consume 50,000 tokens as the model plans, researches, writes, and refines. If per-token pricing stays flat, you are still burning 100× more tokens for the same user action.

ScenarioTokens per requestRelative cost
Simple chat (2023)~500
Agent workflow (today)~50,000100×
Long-context RAG + history~200,000+400×+

How to prevent it

  1. Set context budgets: Cap the number of tokens sent per request at the application layer
  2. Summarize history: Compress older conversation turns instead of re-sending them verbatim
  3. Retrieve selectively: Fetch only the document chunks relevant to the query
  4. Use caching: Many providers offer discounted rates for cached input tokens
  5. Measure per feature: Attribute token growth to specific product features, not just "AI costs went up"

See also: Input and output tokens · Context hygiene


Tokenminning · Built by Narev