Context hygiene

In agent workflows, keep what you send to the model lean, current, and task-relevant. Unchecked context growth is how token costs multiply.

This implements Article III: context window sovereignty. For the underlying concept, see What is context inflation?.

Expected impact

In production agentic workloads, disciplined context hygiene typically delivers 40–60% reduction in input tokens. The savings come from not re-sending data the model does not need for the current turn.

State over history

Each agent turn does not need the full transcript of everything that ever happened. It needs the current state of the task.

Valid state summary:

Research complete. Found 3 relevant sources.
Key finding: API pricing changed March 2025, cache tokens now 50% off.
Next action: Draft comparison table.
Tokens used this session: 12,400 / 50,000 budget.

Invalid context append:

[Full 8,000-token web page content pasted here]
[Full JSON response from previous tool call]
[Complete conversation history from turn 1]

Appending raw API responses, full web page scrapes, or complete file contents to context is state pollution.

Rolling summarization

Instead of re-sending full conversation history on every turn, compress older turns into summaries.

Strategy	When to use	Cadence
Sliding window	Short sessions, recent context matters most	Keep last 3–5 turns verbatim
Rolling summary	Medium sessions, 10–50 turns	Summarize oldest N turns every 10–20 turns
Full compaction	Long sessions approaching budget	Replace entire history with structured state

Use a cheap model for summarization. One summarization call that saves 20,000 input tokens on subsequent turns pays for itself immediately.

Trigger compression at ~80% of your session budget, not reactively when the context window is full. Proactive compression avoids user-facing latency spikes.

Working memory vs episodic memory

Separate what the model needs now from what happened historically.

Memory type	Storage	In context?
Working memory	Compressed state summary	Yes, strictly budgeted
Episodic memory	Database, vector store, event log	No — retrieved selectively

Conflating the two — dumping database logs into context because "the model might need it" — is how 500-token tasks become 50,000-token tasks. Episodic memory is queried, not injected.

Structured memory extraction

Instead of accumulating raw conversation, maintain a structured memory object:

{
  "user_preferences": { "language": "en", "format": "concise" },
  "task_state": { "step": "drafting", "sources_found": 3 },
  "key_facts": ["pricing changed March 2025", "cache tokens 50% off"]
}

Inject only the fields relevant to the current turn.

Tool-output truncation gate

Tool outputs must pass through a truncation or summarization gate before entering context.

Enforcement rules:

Raw tool output is never appended directly to context
Outputs exceeding a defined token threshold are summarized automatically
Context size is monitored per turn; turns that grow context by more than a threshold trigger review alerts
"Send everything" retrieval strategies require explicit budget allocation

Just-in-time retrieval

Fetch only what the current turn needs. Do not dump entire files, repositories, or document collections into context preemptively.

Pattern	Token cost
Dump full repo into context	Very high, mostly irrelevant
Retrieve top-3 relevant chunks per query	Proportional to task
Structured memory + selective retrieval	Lowest, scales with task complexity

See Output and RAG for retrieval-specific discipline.

Sub-agents for context isolation

For complex multi-step tasks, isolate work into sub-agents with focused context:

Parent agent holds task state and orchestration
Sub-agent receives only the inputs needed for its subtask
Sub-agent returns a compressed result, not raw tool output
Parent context stays bounded regardless of sub-agent depth

When context hygiene is not enough

If costs remain high after context discipline:

Static prefix is large and repeated → Prompt caching
Frontier model used for simple subtasks → Model routing
Outputs are verbose despite constraints → Output and RAG

Prompt caching Model routing