Skip to Content
PracticeWhen not to use an LLM

When not to use an LLM

The cheapest token is the one you never send. Before routing to a smaller model or trimming prompts, ask whether the feature needs inference at all.

This guide is the Yes branch from the homepage decision tree: you can stop using AI for some workloads without losing shipped outcomes.

When the LLM is optional

Many production paths already have a cheaper answer:

PatternReplace withExample
Fixed FAQLookup table or CMS snippet”What are your hours?”
Rule-based triageRegex, decision tree, or workflow engineRoute tickets by keyword and priority
Repeated queriesSemantic cachingSupport macros that hit 80% of volume
Structured extractionParser, schema validator, or dedicated ML modelJSON from a form, not from GPT
SearchKeyword or vector search without generationShow chunks; generate only when the user asks

If the output is always the same for the same input, you do not need a generative model.

Narrow the surface area

Removing an LLM-powered feature is often easier than optimizing it:

  1. Audit call sites — list every route, cron job, and IDE integration that hits an API. Tag each with feature and session_id per Article I.
  2. Rank by outcome — which calls correlate with resolved tickets, merged PRs, or revenue events? Which ones only exist because “we added AI”?
  3. Downgrade before delete — switch chat to inline completion, shrink context, or route to a rules engine. Measure again.
  4. Delete the route — remove the API handler, feature flag, or IDE rule once traffic and outcomes justify it.

For coding assistants, GitHub Copilot illustrates the split: inline suggestions burn fewer tokens than agent chat for routine edits. Default to the lighter mode until complexity proves otherwise.

Semantic caching as elimination

Semantic caching does not shrink prompts — it removes entire inference calls when a similar question was answered before. High-repetition workloads (support macros, internal FAQs, status lookups) can see most traffic served from cache with zero new tokens.

This is not prompt optimization. It is skipping the model.

Feature removal checklist

Before you decommission an LLM integration:

  • Metered baseline exists — you know cost per feature, not just total bill
  • A non-LLM path handles the happy case (lookup, cache, or heuristic)
  • Quality eval exists for the remaining LLM calls — you will not regress silently
  • Stakeholders agree on the outcome metric (not token volume)
  • Rollback plan documented — feature flag or dual-write during transition

Decommission path

Work in this order:

  1. Instrument — spans with feature, user_id, and token counts (Article I)
  2. Prove zero value — show features where cost per outcome is flat or negative
  3. Route traffic away — cache hits, rules engine, or human workflow
  4. Remove the integration — API keys, MCP servers, system prompts, and IDE rules
  5. Verify — cost drops on the attributed feature slice, not just globally

When you still need the model

If you answered No on the homepage tree, you pay for tokens via subscription or API. Continue with Where to start for the optimization sequence, or open IDE guides if most spend is in a coding assistant.

Last updated on