Model routing
Send each request to the cheapest model that meets your quality bar — escalate to more capable models only when cheaper options fail.
This implements Article II: the routing mandate. For the conceptual overview, see What is model selection?.
Expected impact
| Approach | Typical savings |
|---|---|
| Switch from frontier to mid-tier model | 60–90% |
| Use a smaller model for simple tasks | 80–95% |
| Route by task complexity (blended) | 40–70% |
Savings depend on your task mix. Classification, extraction, and formatting tasks rarely need frontier models.
Default to the smallest capable model
All tasks default to the smallest capable model. The routing layer — not application code — owns initial model selection.
Application code requests a capability (e.g., classify-intent, generate-summary), and the router maps it to the cheapest model that meets the SLA.
Prohibited:
const response = await openai.chat.completions.create({
model: "gpt-4", // hardcoded frontier model
messages: [...],
});Required:
const response = await router.complete({
capability: "classify-intent",
messages: [...],
});
// router maps classify-intent → cheapest model passing quality SLACascade routing
Try the cheap model first. Escalate only if quality checks fail. Never default upward.
Request → Small model
↓ quality check passes → return
↓ quality check fails → Mid-tier model
↓ passes → return (log escalation)
↓ fails → Frontier model (log justification)Every escalation must be logged with:
- The cheaper model attempted
- Why it failed (quality score, classifier signal, user tier)
- The frontier model selected and its cost delta
Frontier model justification
Routing to a frontier model requires programmatic justification. Acceptable signals:
- A complexity classifier returned
score > threshold - A cheaper model attempt failed quality checks (with the failure logged)
- The user explicitly selected a "high quality" tier (with corresponding billing)
Unacceptable justification:
- Developer preference
- "The prompt is long"
- Absence of a routing layer entirely
Every frontier model call must be auditable. If you cannot produce the justification log for a given inference, the call should not have happened.
Benchmark before you switch
Do not swap models based on blog post savings claims. Benchmark on your data:
- Collect 1,000 real inputs from your highest-volume use case
- Run parallel tests across candidate models
- Compare quality, cost, latency, consistency, and edge-case behavior
- Choose the cheapest model that meets your quality bar
| Dimension | What to measure |
|---|---|
| Quality | Manual review or automated eval against your rubric |
| Cost | Actual tokens consumed × model pricing |
| Latency | P50, P95, P99 response times |
| Consistency | Output variance across identical inputs |
| Edge cases | Behavior on unusual or malformed inputs |
Narev (opens in a new tab) provides model pricing data, cost calculation, and A/B testing to compare models on real workloads.
Routing strategies
| Strategy | Best for | Implementation |
|---|---|---|
| Task-based routing | Known task types with different complexity | Map capabilities to model tiers |
| Cascade routing | Unknown complexity, quality-critical | Cheap first, escalate on failure |
| User-tier routing | Freemium or tiered products | Free → cheap model; paid → premium |
Combine strategies: task-based routing for known patterns, cascade for ambiguous inputs.
When routing is not enough
If you have routed to the cheapest capable model and costs are still high:
- Input context is bloated → Context hygiene
- Same prefix reprocessed every request → Prompt caching
- System prompts are verbose → Prompt hygiene