Skip to Content
PracticeLocal inference

Local inference

Run models on hardware you control instead of paying per-token API rates. This is the last lever on the API per-token branch of the homepage decision tree — after metering, model routing, and caching are in place.

Local inference trades cloud token bills for GPU capex, ops time, and usually lower model quality on the same hardware budget.

Expected impact

SetupWhat you saveWhat you spend instead
Dev machine + Ollama100% of cloud API tokens for that workloadGPU/CPU time, electricity, engineer setup
Production vLLM clusterAPI margin at scaleGPU nodes, load balancing, model updates
Hybrid (local for drafts, cloud for quality)40–80% of API tokens on routed tasksTwo code paths to maintain

Benchmark wall-clock and output quality against your cloud baseline — cheaper per token, but a cheap local model that fails quality checks costs more than one frontier API call.

When local inference makes sense

Good fit:

  • High-volume, repetitive tasks where a 7B–14B model is good enough (classification, extraction, formatting)
  • Dev and CI workloads where latency and privacy matter more than frontier quality
  • Teams already paying for idle GPU capacity
  • Workloads where API spend exceeds the fully loaded cost of a dedicated inference node

Poor fit:

  • Tasks that genuinely need frontier reasoning and fail on smaller local models
  • Low-traffic endpoints where GPU idle time dominates
  • Teams without someone to patch models, monitor VRAM, and handle failover

Common stacks

Pick by deployment shape — not by IDE.

ToolBest forAPI shape
Ollama Single-machine dev, quick experimentsOpenAI-compatible /v1 on localhost
vLLM Production serving, high concurrencyOpenAI-compatible server
llama.cpp CPU or Apple Silicon, minimal depsHTTP server or embedded
Text Generation Inference (TGI) Hugging Face models at scaleOpenAI-compatible

Most application code needs only an OpenAI-compatible base URL change — your model routing layer can treat local/small and openai/gpt-4o-mini as tiers in the same cascade.

const local = openai({ baseURL: "http://localhost:11434/v1", apiKey: "ollama", // required but ignored by Ollama }); const response = await router.complete({ capability: "classify-intent", messages, // router tries local first, escalates to cloud on quality failure });

Metering still applies

Local inference removes the API meter — it does not remove the need to measure cost.

Track at minimum:

  • Tokens in / out — vLLM and Ollama expose usage in responses; log them the same way as cloud calls
  • Wall-clock latency — slow local inference can cost more engineer time than API tokens
  • GPU utilization — idle VRAM is wasted capex; right-size models to your hardware

See Article I: immutable metering. Cost attribution should include provider: local and model tags so dashboards compare cloud vs self-hosted spend fairly.

Integration with coding tools

Local backends are not IDE-specific. Any tool that accepts a custom OpenAI-compatible endpoint can point at Ollama or vLLM:

  • Aider — set provider to Ollama or a local OpenAI-compatible URL in .aider.conf.yml
  • Cline — configure a local endpoint in provider settings
  • Zed — Ollama or custom provider in Assistant settings

For a shared API key across agents, OpenRouter is cloud-only — use local routing in application code instead.

Anti-patterns

Anti-patternWhy it fails
Self-hosting before cloud optimizationsYou replicate the same bloated prompts on expensive hardware
Running frontier-class models on consumer GPUsOOM errors, 30s+ latency, team reverts to cloud
No quality gate in the routing cascadeLocal model silently degrades output; users escalate manually
Treating local inference as freeGPU lease, power, and on-call time are real COGS

Next steps

  • Not ready for self-hosted? Stay on the cloud path — Semantic caching is the prior step in the sequence.
  • Running agents against local models? Apply Context hygiene — input bloat hurts local latency even more than cloud cost.
  • Need normalized cost math across cloud and local? Narev  supports comparing provider rates; add your own GPU hourly rate for local rows.
Last updated on