Context hygiene
In agent workflows, keep what you send to the model lean, current, and task-relevant. Unchecked context growth is how token costs multiply.
This implements Article III: context window sovereignty. For the underlying concept, see What is context inflation?.
Expected impact
In production agentic workloads, disciplined context hygiene typically delivers 40–60% reduction in input tokens. The savings come from not re-sending data the model does not need for the current turn.
State over history
Each agent turn does not need the full transcript of everything that ever happened. It needs the current state of the task.
Valid state summary:
Research complete. Found 3 relevant sources.
Key finding: API pricing changed March 2025, cache tokens now 50% off.
Next action: Draft comparison table.
Tokens used this session: 12,400 / 50,000 budget.Invalid context append:
[Full 8,000-token web page content pasted here]
[Full JSON response from previous tool call]
[Complete conversation history from turn 1]Appending raw API responses, full web page scrapes, or complete file contents to context is state pollution.
Rolling summarization
Instead of re-sending full conversation history on every turn, compress older turns into summaries.
| Strategy | When to use | Cadence |
|---|---|---|
| Sliding window | Short sessions, recent context matters most | Keep last 3–5 turns verbatim |
| Rolling summary | Medium sessions, 10–50 turns | Summarize oldest N turns every 10–20 turns |
| Full compaction | Long sessions approaching budget | Replace entire history with structured state |
Use a cheap model for summarization. One summarization call that saves 20,000 input tokens on subsequent turns pays for itself immediately.
Trigger compression at ~80% of your session budget, not reactively when the context window is full. Proactive compression avoids user-facing latency spikes.
Working memory vs episodic memory
Separate what the model needs now from what happened historically.
| Memory type | Storage | In context? |
|---|---|---|
| Working memory | Compressed state summary | Yes, strictly budgeted |
| Episodic memory | Database, vector store, event log | No — retrieved selectively |
Conflating the two — dumping database logs into context because "the model might need it" — is how 500-token tasks become 50,000-token tasks. Episodic memory is queried, not injected.
Structured memory extraction
Instead of accumulating raw conversation, maintain a structured memory object:
{
"user_preferences": { "language": "en", "format": "concise" },
"task_state": { "step": "drafting", "sources_found": 3 },
"key_facts": ["pricing changed March 2025", "cache tokens 50% off"]
}Inject only the fields relevant to the current turn.
Tool-output truncation gate
Tool outputs must pass through a truncation or summarization gate before entering context.
Enforcement rules:
- Raw tool output is never appended directly to context
- Outputs exceeding a defined token threshold are summarized automatically
- Context size is monitored per turn; turns that grow context by more than a threshold trigger review alerts
- "Send everything" retrieval strategies require explicit budget allocation
Just-in-time retrieval
Fetch only what the current turn needs. Do not dump entire files, repositories, or document collections into context preemptively.
| Pattern | Token cost |
|---|---|
| Dump full repo into context | Very high, mostly irrelevant |
| Retrieve top-3 relevant chunks per query | Proportional to task |
| Structured memory + selective retrieval | Lowest, scales with task complexity |
See Output and RAG for retrieval-specific discipline.
Sub-agents for context isolation
For complex multi-step tasks, isolate work into sub-agents with focused context:
- Parent agent holds task state and orchestration
- Sub-agent receives only the inputs needed for its subtask
- Sub-agent returns a compressed result, not raw tool output
- Parent context stays bounded regardless of sub-agent depth
When context hygiene is not enough
If costs remain high after context discipline:
- Static prefix is large and repeated → Prompt caching
- Frontier model used for simple subtasks → Model routing
- Outputs are verbose despite constraints → Output and RAG