Context inflation

Context inflation is the gradual increase in tokens consumed per AI request over time, even when per-token prices stay flat.

A team that sends the same number of API calls each month can still see costs multiply because each call carries more system prompts, conversation history, retrieved documents, tool outputs, and reasoning tokens than it did six months ago.

Why it happens

Modern models and workflows consume more tokens per task through several compounding trends:

Longer context windows: Limits of 128K → 200K → 1M tokens encourage sending more data "just in case"
Test-time compute: Models that reason or "think" longer generate additional internal tokens
Agentic workflows: Multi-step agents iterate, verify, and refine — multiplying token usage per user task
RAG over-fetching: Retrieval systems pull large document chunks instead of the minimum relevant context

What it costs

The math is simple but easy to overlook. A query that once returned 500 tokens might now consume 50,000 tokens as the model plans, researches, writes, and refines. If per-token pricing stays flat, you are still burning 100× more tokens for the same user action.

Scenario	Tokens per request	Relative cost
Simple chat (2023)	~500	1×
Agent workflow (today)	~50,000	100×
Long-context RAG + history	~200,000+	400×+

How to prevent it

Set context budgets: Cap the number of tokens sent per request at the application layer
Summarize history: Compress older conversation turns instead of re-sending them verbatim
Retrieve selectively: Fetch only the document chunks relevant to the query
Use caching: Many providers offer discounted rates for cached input tokens
Measure per feature: Attribute token growth to specific product features, not just "AI costs went up"

Input and output tokens Model selection