Model selection
Model selection is choosing the right large language model (LLM) for each task to balance output quality, inference cost, and response latency.
Not every feature needs your most capable (and most expensive) model. A product description generator, a classification task, and a complex reasoning agent have very different quality requirements. Routing each to an appropriate model can cut spend dramatically without hurting user experience.
Why it matters for cost
Frontier models cost significantly more per token than mid-tier and small models. Benchmarks on your actual data often show that a cheaper model meets your quality bar:
| Approach | Typical savings |
|---|---|
| Switch from frontier to mid-tier model | 60–90% |
| Use a smaller model for simple tasks | 80–95% |
| Route by task complexity | 40–70% (blended) |
How to evaluate models
For your highest-volume use case, run parallel tests on 1,000 real inputs and compare:
- Quality: Manual review or automated evaluation against your rubric
- Cost: Actual tokens consumed × model pricing
- Latency: P50, P95, and P99 response times
- Consistency: Do outputs vary wildly, or are they stable?
- Edge cases: How does the model handle unusual or broken inputs?
Common strategies
Instead of one model for everything, production teams typically use one of these patterns:
- Task-based routing: Simple queries → cheap model; complex queries → capable model
- Cascade routing: Try the cheap model first; escalate only if quality checks fail
- User-tier routing: Free users get a faster, cheaper model; paid users get premium models
Tools for model selection
Narev (opens in a new tab) provides model pricing data, cost calculation, and A/B testing to compare models on real workloads before you commit to a switch.
See also: Model routing · Input and output tokens