Context hygiene

Context hygiene

In agent workflows, keep what you send to the model lean, current, and task-relevant. Unchecked context growth is how token costs multiply.

This implements Article III: context window sovereignty. For the underlying concept, see What is context inflation?.

Expected impact

In production agentic workloads, disciplined context hygiene typically delivers 40–60% reduction in input tokens. The savings come from not re-sending data the model does not need for the current turn.

State over history

Each agent turn does not need the full transcript of everything that ever happened. It needs the current state of the task.

Valid state summary:

Research complete. Found 3 relevant sources.
Key finding: API pricing changed March 2025, cache tokens now 50% off.
Next action: Draft comparison table.
Tokens used this session: 12,400 / 50,000 budget.

Invalid context append:

[Full 8,000-token web page content pasted here]
[Full JSON response from previous tool call]
[Complete conversation history from turn 1]

Appending raw API responses, full web page scrapes, or complete file contents to context is state pollution.

Rolling summarization

Instead of re-sending full conversation history on every turn, compress older turns into summaries.

StrategyWhen to useCadence
Sliding windowShort sessions, recent context matters mostKeep last 3–5 turns verbatim
Rolling summaryMedium sessions, 10–50 turnsSummarize oldest N turns every 10–20 turns
Full compactionLong sessions approaching budgetReplace entire history with structured state

Use a cheap model for summarization. One summarization call that saves 20,000 input tokens on subsequent turns pays for itself immediately.

Trigger compression at ~80% of your session budget, not reactively when the context window is full. Proactive compression avoids user-facing latency spikes.

Working memory vs episodic memory

Separate what the model needs now from what happened historically.

Memory typeStorageIn context?
Working memoryCompressed state summaryYes, strictly budgeted
Episodic memoryDatabase, vector store, event logNo — retrieved selectively

Conflating the two — dumping database logs into context because "the model might need it" — is how 500-token tasks become 50,000-token tasks. Episodic memory is queried, not injected.

Structured memory extraction

Instead of accumulating raw conversation, maintain a structured memory object:

{
  "user_preferences": { "language": "en", "format": "concise" },
  "task_state": { "step": "drafting", "sources_found": 3 },
  "key_facts": ["pricing changed March 2025", "cache tokens 50% off"]
}

Inject only the fields relevant to the current turn.

Tool-output truncation gate

Tool outputs must pass through a truncation or summarization gate before entering context.

Enforcement rules:

  • Raw tool output is never appended directly to context
  • Outputs exceeding a defined token threshold are summarized automatically
  • Context size is monitored per turn; turns that grow context by more than a threshold trigger review alerts
  • "Send everything" retrieval strategies require explicit budget allocation

Just-in-time retrieval

Fetch only what the current turn needs. Do not dump entire files, repositories, or document collections into context preemptively.

PatternToken cost
Dump full repo into contextVery high, mostly irrelevant
Retrieve top-3 relevant chunks per queryProportional to task
Structured memory + selective retrievalLowest, scales with task complexity

See Output and RAG for retrieval-specific discipline.

Sub-agents for context isolation

For complex multi-step tasks, isolate work into sub-agents with focused context:

  • Parent agent holds task state and orchestration
  • Sub-agent receives only the inputs needed for its subtask
  • Sub-agent returns a compressed result, not raw tool output
  • Parent context stays bounded regardless of sub-agent depth

When context hygiene is not enough

If costs remain high after context discipline:


Tokenminning · Built by Narev