Output and RAG
Control both sides of the inference call. Output tokens are often priced 3–5× higher than input, and RAG systems can silently inflate input costs by retrieving far more context than the model needs.
This implements Article V: prompt schema standards (output control) and Article III: context window sovereignty (retrieval discipline).
Expected impact
| Technique | Typical savings |
|---|---|
max_tokens + structured output | 15–30% on verbose endpoints |
| Minimum viable RAG context | 20–40% on over-fetching pipelines |
| Schema length constraints | Prevents runaway generation |
Output control
Set max_tokens everywhere
If max_tokens is missing, the model decides how long to generate. For structured tasks, this is unpredictable cost.
{
"model": "gpt-4o-mini",
"max_tokens": 256,
"response_format": { "type": "json_schema", "json_schema": { ... } }
}Set max_tokens based on your longest acceptable response, not the model's maximum.
Structured output over free text
Free-text responses are verbose by default. Schema-enforced outputs constrain length and format:
| Approach | Output behavior |
|---|---|
| "Summarize this document" | 200–2,000 tokens, unpredictable |
JSON Schema with maxLength on summary field | Bounded, validatable |
| Function calling with typed parameters | Minimal, machine-parseable |
Length constraints in schema
Enforce output bounds at the schema level, not in prose:
{
"properties": {
"summary": { "type": "string", "maxLength": 500 },
"tags": { "type": "array", "maxItems": 5, "items": { "type": "string" } }
}
}This is more reliable than "keep your response concise" in the system prompt — and it saves the tokens those instructions would cost.
RAG discipline
Retrieval-augmented generation is a common source of context inflation. The failure mode: a query that once needed 500 tokens of context now pulls 50,000 tokens of "relevant" documents.
Minimum viable context
Retrieve the smallest set of chunks that answers the query. Not the largest set that might help.
| Pattern | Tokens sent | Quality |
|---|---|---|
| Top-20 chunks, no re-ranking | High, noisy | Variable |
| Top-5 chunks after re-ranking | Moderate | Better precision |
| Top-3 chunks, relevance threshold | Low | Sufficient for most queries |
Chunking strategy
How you split documents affects total tokens retrieved:
- Fixed-size chunks (e.g., 512 tokens): simple but may split concepts across boundaries
- Semantic chunking: splits on meaning, preserving complete semantic units — fewer total chunks needed for the same answer quality
- Overlapping chunks: increases redundancy; use sparingly
Re-ranking
Initial vector retrieval casts a wide net. A re-ranker narrows to the most relevant results before sending to the LLM:
Query → Vector search (top 20) → Re-ranker (top 3) → LLMThis typically reduces input tokens 3–5× with equal or better answer quality.
The 500-to-50,000 problem
A common RAG failure pattern:
| Stage | Tokens |
|---|---|
| Original simple query | ~500 |
| + Full document retrieval (10 chunks × 500 tokens) | ~5,500 |
| + Conversation history (agent loop, 10 turns) | ~25,000 |
| + Tool outputs appended raw | ~50,000+ |
Each layer compounds. Fix retrieval scope and context hygiene together:
- RAG discipline reduces what you retrieve
- Context hygiene reduces what you retain across turns
When this is not enough
- Input dominated by system prompt + tools → Prompt hygiene and Prompt caching
- Wrong model for the task → Model routing
- Repetitive queries hitting the LLM every time → Semantic caching