Output and RAG

Output and RAG

Control both sides of the inference call. Output tokens are often priced 3–5× higher than input, and RAG systems can silently inflate input costs by retrieving far more context than the model needs.

This implements Article V: prompt schema standards (output control) and Article III: context window sovereignty (retrieval discipline).

Expected impact

TechniqueTypical savings
max_tokens + structured output15–30% on verbose endpoints
Minimum viable RAG context20–40% on over-fetching pipelines
Schema length constraintsPrevents runaway generation

Output control

Set max_tokens everywhere

If max_tokens is missing, the model decides how long to generate. For structured tasks, this is unpredictable cost.

{
  "model": "gpt-4o-mini",
  "max_tokens": 256,
  "response_format": { "type": "json_schema", "json_schema": { ... } }
}

Set max_tokens based on your longest acceptable response, not the model's maximum.

Structured output over free text

Free-text responses are verbose by default. Schema-enforced outputs constrain length and format:

ApproachOutput behavior
"Summarize this document"200–2,000 tokens, unpredictable
JSON Schema with maxLength on summary fieldBounded, validatable
Function calling with typed parametersMinimal, machine-parseable

Length constraints in schema

Enforce output bounds at the schema level, not in prose:

{
  "properties": {
    "summary": { "type": "string", "maxLength": 500 },
    "tags": { "type": "array", "maxItems": 5, "items": { "type": "string" } }
  }
}

This is more reliable than "keep your response concise" in the system prompt — and it saves the tokens those instructions would cost.

RAG discipline

Retrieval-augmented generation is a common source of context inflation. The failure mode: a query that once needed 500 tokens of context now pulls 50,000 tokens of "relevant" documents.

Minimum viable context

Retrieve the smallest set of chunks that answers the query. Not the largest set that might help.

PatternTokens sentQuality
Top-20 chunks, no re-rankingHigh, noisyVariable
Top-5 chunks after re-rankingModerateBetter precision
Top-3 chunks, relevance thresholdLowSufficient for most queries

Chunking strategy

How you split documents affects total tokens retrieved:

  • Fixed-size chunks (e.g., 512 tokens): simple but may split concepts across boundaries
  • Semantic chunking: splits on meaning, preserving complete semantic units — fewer total chunks needed for the same answer quality
  • Overlapping chunks: increases redundancy; use sparingly

Re-ranking

Initial vector retrieval casts a wide net. A re-ranker narrows to the most relevant results before sending to the LLM:

Query → Vector search (top 20) → Re-ranker (top 3) → LLM

This typically reduces input tokens 3–5× with equal or better answer quality.

The 500-to-50,000 problem

A common RAG failure pattern:

StageTokens
Original simple query~500
+ Full document retrieval (10 chunks × 500 tokens)~5,500
+ Conversation history (agent loop, 10 turns)~25,000
+ Tool outputs appended raw~50,000+

Each layer compounds. Fix retrieval scope and context hygiene together:

  • RAG discipline reduces what you retrieve
  • Context hygiene reduces what you retain across turns

When this is not enough


Tokenminning · Built by Narev