Model routing

Send each request to the cheapest model that meets your quality bar — escalate to more capable models only when cheaper options fail.

This implements Article II: the routing mandate. For the conceptual overview, see What is model selection?.

Expected impact

Approach	Typical savings
Switch from frontier to mid-tier model	60–90%
Use a smaller model for simple tasks	80–95%
Route by task complexity (blended)	40–70%

Savings depend on your task mix. Classification, extraction, and formatting tasks rarely need frontier models.

Default to the smallest capable model

All tasks default to the smallest capable model. The routing layer — not application code — owns initial model selection.

Application code requests a capability (e.g., classify-intent, generate-summary), and the router maps it to the cheapest model that meets the SLA.

Prohibited:

const response = await openai.chat.completions.create({
  model: "gpt-4",  // hardcoded frontier model
  messages: [...],
});

Required:

const response = await router.complete({
  capability: "classify-intent",
  messages: [...],
});
// router maps classify-intent → cheapest model passing quality SLA

Cascade routing

Try the cheap model first. Escalate only if quality checks fail. Never default upward.

Request → Small model
            ↓ quality check passes → return
            ↓ quality check fails → Mid-tier model
                                    ↓ passes → return (log escalation)
                                    ↓ fails → Frontier model (log justification)

Every escalation must be logged with:

The cheaper model attempted
Why it failed (quality score, classifier signal, user tier)
The frontier model selected and its cost delta

Frontier model justification

Routing to a frontier model requires programmatic justification. Acceptable signals:

A complexity classifier returned score > threshold
A cheaper model attempt failed quality checks (with the failure logged)
The user explicitly selected a "high quality" tier (with corresponding billing)

Unacceptable justification:

Developer preference
"The prompt is long"
Absence of a routing layer entirely

Every frontier model call must be auditable. If you cannot produce the justification log for a given inference, the call should not have happened.

Benchmark before you switch

Do not swap models based on blog post savings claims. Benchmark on your data:

Collect 1,000 real inputs from your highest-volume use case
Run parallel tests across candidate models
Compare quality, cost, latency, consistency, and edge-case behavior
Choose the cheapest model that meets your quality bar

Dimension	What to measure
Quality	Manual review or automated eval against your rubric
Cost	Actual tokens consumed × model pricing
Latency	P50, P95, P99 response times
Consistency	Output variance across identical inputs
Edge cases	Behavior on unusual or malformed inputs

Narev (opens in a new tab) provides model pricing data, cost calculation, and A/B testing to compare models on real workloads.

Routing strategies

Strategy	Best for	Implementation
Task-based routing	Known task types with different complexity	Map capabilities to model tiers
Cascade routing	Unknown complexity, quality-critical	Cheap first, escalate on failure
User-tier routing	Freemium or tiered products	Free → cheap model; paid → premium

Combine strategies: task-based routing for known patterns, cascade for ambiguous inputs.

When routing is not enough

If you have routed to the cheapest capable model and costs are still high:

Input context is bloated → Context hygiene
Same prefix reprocessed every request → Prompt caching
System prompts are verbose → Prompt hygiene

Context hygiene Output and RAG