Local inference
Run models on hardware you control instead of paying per-token API rates. This is the last lever on the API per-token branch of the homepage decision tree — after metering, model routing, and caching are in place.
Local inference trades cloud token bills for GPU capex, ops time, and usually lower model quality on the same hardware budget.
Expected impact
| Setup | What you save | What you spend instead |
|---|---|---|
| Dev machine + Ollama | 100% of cloud API tokens for that workload | GPU/CPU time, electricity, engineer setup |
| Production vLLM cluster | API margin at scale | GPU nodes, load balancing, model updates |
| Hybrid (local for drafts, cloud for quality) | 40–80% of API tokens on routed tasks | Two code paths to maintain |
Benchmark wall-clock and output quality against your cloud baseline — cheaper per token, but a cheap local model that fails quality checks costs more than one frontier API call.
When local inference makes sense
Good fit:
- High-volume, repetitive tasks where a 7B–14B model is good enough (classification, extraction, formatting)
- Dev and CI workloads where latency and privacy matter more than frontier quality
- Teams already paying for idle GPU capacity
- Workloads where API spend exceeds the fully loaded cost of a dedicated inference node
Poor fit:
- Tasks that genuinely need frontier reasoning and fail on smaller local models
- Low-traffic endpoints where GPU idle time dominates
- Teams without someone to patch models, monitor VRAM, and handle failover
Common stacks
Pick by deployment shape — not by IDE.
| Tool | Best for | API shape |
|---|---|---|
| Ollama | Single-machine dev, quick experiments | OpenAI-compatible /v1 on localhost |
| vLLM | Production serving, high concurrency | OpenAI-compatible server |
| llama.cpp | CPU or Apple Silicon, minimal deps | HTTP server or embedded |
| Text Generation Inference (TGI) | Hugging Face models at scale | OpenAI-compatible |
Most application code needs only an OpenAI-compatible base URL change — your model routing layer can treat local/small and openai/gpt-4o-mini as tiers in the same cascade.
const local = openai({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // required but ignored by Ollama
});
const response = await router.complete({
capability: "classify-intent",
messages,
// router tries local first, escalates to cloud on quality failure
});Metering still applies
Local inference removes the API meter — it does not remove the need to measure cost.
Track at minimum:
- Tokens in / out — vLLM and Ollama expose usage in responses; log them the same way as cloud calls
- Wall-clock latency — slow local inference can cost more engineer time than API tokens
- GPU utilization — idle VRAM is wasted capex; right-size models to your hardware
See Article I: immutable metering. Cost attribution should include provider: local and model tags so dashboards compare cloud vs self-hosted spend fairly.
Integration with coding tools
Local backends are not IDE-specific. Any tool that accepts a custom OpenAI-compatible endpoint can point at Ollama or vLLM:
- Aider — set provider to Ollama or a local OpenAI-compatible URL in
.aider.conf.yml - Cline — configure a local endpoint in provider settings
- Zed — Ollama or custom provider in Assistant settings
For a shared API key across agents, OpenRouter is cloud-only — use local routing in application code instead.
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| Self-hosting before cloud optimizations | You replicate the same bloated prompts on expensive hardware |
| Running frontier-class models on consumer GPUs | OOM errors, 30s+ latency, team reverts to cloud |
| No quality gate in the routing cascade | Local model silently degrades output; users escalate manually |
| Treating local inference as free | GPU lease, power, and on-call time are real COGS |
Next steps
- Not ready for self-hosted? Stay on the cloud path — Semantic caching is the prior step in the sequence.
- Running agents against local models? Apply Context hygiene — input bloat hurts local latency even more than cloud cost.
- Need normalized cost math across cloud and local? Narev supports comparing provider rates; add your own GPU hourly rate for local rows.