Model selection

Model selection

Model selection is choosing the right large language model (LLM) for each task to balance output quality, inference cost, and response latency.

Not every feature needs your most capable (and most expensive) model. A product description generator, a classification task, and a complex reasoning agent have very different quality requirements. Routing each to an appropriate model can cut spend dramatically without hurting user experience.

Why it matters for cost

Frontier models cost significantly more per token than mid-tier and small models. Benchmarks on your actual data often show that a cheaper model meets your quality bar:

ApproachTypical savings
Switch from frontier to mid-tier model60–90%
Use a smaller model for simple tasks80–95%
Route by task complexity40–70% (blended)

How to evaluate models

For your highest-volume use case, run parallel tests on 1,000 real inputs and compare:

  1. Quality: Manual review or automated evaluation against your rubric
  2. Cost: Actual tokens consumed × model pricing
  3. Latency: P50, P95, and P99 response times
  4. Consistency: Do outputs vary wildly, or are they stable?
  5. Edge cases: How does the model handle unusual or broken inputs?

Common strategies

Instead of one model for everything, production teams typically use one of these patterns:

  • Task-based routing: Simple queries → cheap model; complex queries → capable model
  • Cascade routing: Try the cheap model first; escalate only if quality checks fail
  • User-tier routing: Free users get a faster, cheaper model; paid users get premium models

Tools for model selection

Narev (opens in a new tab) provides model pricing data, cost calculation, and A/B testing to compare models on real workloads before you commit to a switch.

See also: Model routing · Input and output tokens


Tokenminning · Built by Narev