Output and RAG

Control both sides of the inference call. Output tokens are often priced 3–5× higher than input, and RAG systems can silently inflate input costs by retrieving far more context than the model needs.

This implements Article V: prompt schema standards (output control) and Article III: context window sovereignty (retrieval discipline).

Expected impact

Technique	Typical savings
`max_tokens` + structured output	15–30% on verbose endpoints
Minimum viable RAG context	20–40% on over-fetching pipelines
Schema length constraints	Prevents runaway generation

Output control

Set max_tokens everywhere

If max_tokens is missing, the model decides how long to generate. For structured tasks, this is unpredictable cost.

{
  "model": "gpt-4o-mini",
  "max_tokens": 256,
  "response_format": { "type": "json_schema", "json_schema": { ... } }
}

Set max_tokens based on your longest acceptable response, not the model's maximum.

Structured output over free text

Free-text responses are verbose by default. Schema-enforced outputs constrain length and format:

Approach	Output behavior
"Summarize this document"	200–2,000 tokens, unpredictable
JSON Schema with `maxLength` on summary field	Bounded, validatable
Function calling with typed parameters	Minimal, machine-parseable

Length constraints in schema

Enforce output bounds at the schema level, not in prose:

{
  "properties": {
    "summary": { "type": "string", "maxLength": 500 },
    "tags": { "type": "array", "maxItems": 5, "items": { "type": "string" } }
  }
}

This is more reliable than "keep your response concise" in the system prompt — and it saves the tokens those instructions would cost.

RAG discipline

Retrieval-augmented generation is a common source of context inflation. The failure mode: a query that once needed 500 tokens of context now pulls 50,000 tokens of "relevant" documents.

Minimum viable context

Retrieve the smallest set of chunks that answers the query. Not the largest set that might help.

Pattern	Tokens sent	Quality
Top-20 chunks, no re-ranking	High, noisy	Variable
Top-5 chunks after re-ranking	Moderate	Better precision
Top-3 chunks, relevance threshold	Low	Sufficient for most queries

Chunking strategy

How you split documents affects total tokens retrieved:

Fixed-size chunks (e.g., 512 tokens): simple but may split concepts across boundaries
Semantic chunking: splits on meaning, preserving complete semantic units — fewer total chunks needed for the same answer quality
Overlapping chunks: increases redundancy; use sparingly

Re-ranking

Initial vector retrieval casts a wide net. A re-ranker narrows to the most relevant results before sending to the LLM:

Query → Vector search (top 20) → Re-ranker (top 3) → LLM

This typically reduces input tokens 3–5× with equal or better answer quality.

The 500-to-50,000 problem

A common RAG failure pattern:

Stage	Tokens
Original simple query	~500
+ Full document retrieval (10 chunks × 500 tokens)	~5,500
+ Conversation history (agent loop, 10 turns)	~25,000
+ Tool outputs appended raw	~50,000+

Each layer compounds. Fix retrieval scope and context hygiene together:

RAG discipline reduces what you retrieve
Context hygiene reduces what you retain across turns

When this is not enough

Input dominated by system prompt + tools → Prompt hygiene and Prompt caching
Wrong model for the task → Model routing
Repetitive queries hitting the LLM every time → Semantic caching

Model routing Semantic caching