Local inference – Tokenminning

Run models on hardware you control instead of paying per-token API rates. This is the last lever on the API per-token branch of the homepage decision tree — after metering, model routing, and caching are in place.

Local inference trades cloud token bills for GPU capex, ops time, and usually lower model quality on the same hardware budget.

Expected impact

Setup	What you save	What you spend instead
Dev machine + Ollama	100% of cloud API tokens for that workload	GPU/CPU time, electricity, engineer setup
Production vLLM cluster	API margin at scale	GPU nodes, load balancing, model updates
Hybrid (local for drafts, cloud for quality)	40–80% of API tokens on routed tasks	Two code paths to maintain

Benchmark wall-clock and output quality against your cloud baseline — cheaper per token, but a cheap local model that fails quality checks costs more than one frontier API call.

When local inference makes sense

Good fit:

High-volume, repetitive tasks where a 7B–14B model is good enough (classification, extraction, formatting)
Dev and CI workloads where latency and privacy matter more than frontier quality
Teams already paying for idle GPU capacity
Workloads where API spend exceeds the fully loaded cost of a dedicated inference node

Poor fit:

Tasks that genuinely need frontier reasoning and fail on smaller local models
Low-traffic endpoints where GPU idle time dominates
Teams without someone to patch models, monitor VRAM, and handle failover

Common stacks

Pick by deployment shape — not by IDE.

Tool	Best for	API shape
Ollama	Single-machine dev, quick experiments	OpenAI-compatible `/v1` on localhost
vLLM	Production serving, high concurrency	OpenAI-compatible server
llama.cpp	CPU or Apple Silicon, minimal deps	HTTP server or embedded
Text Generation Inference (TGI)	Hugging Face models at scale	OpenAI-compatible

Most application code needs only an OpenAI-compatible base URL change — your model routing layer can treat local/small and openai/gpt-4o-mini as tiers in the same cascade.


const local = openai({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required but ignored by Ollama
});
 
const response = await router.complete({
  capability: "classify-intent",
  messages,
  // router tries local first, escalates to cloud on quality failure
});

Metering still applies

Local inference removes the API meter — it does not remove the need to measure cost.

Track at minimum:

Tokens in / out — vLLM and Ollama expose usage in responses; log them the same way as cloud calls
Wall-clock latency — slow local inference can cost more engineer time than API tokens
GPU utilization — idle VRAM is wasted capex; right-size models to your hardware

See Article I: immutable metering. Cost attribution should include provider: local and model tags so dashboards compare cloud vs self-hosted spend fairly.

Integration with coding tools

Local backends are not IDE-specific. Any tool that accepts a custom OpenAI-compatible endpoint can point at Ollama or vLLM:

Aider — set provider to Ollama or a local OpenAI-compatible URL in .aider.conf.yml
Cline — configure a local endpoint in provider settings
Zed — Ollama or custom provider in Assistant settings

For a shared API key across agents, OpenRouter is cloud-only — use local routing in application code instead.

Anti-patterns

Anti-pattern	Why it fails
Self-hosting before cloud optimizations	You replicate the same bloated prompts on expensive hardware
Running frontier-class models on consumer GPUs	OOM errors, 30s+ latency, team reverts to cloud
No quality gate in the routing cascade	Local model silently degrades output; users escalate manually
Treating local inference as free	GPU lease, power, and on-call time are real COGS

Next steps

Not ready for self-hosted? Stay on the cloud path — Semantic caching is the prior step in the sequence.
Running agents against local models? Apply Context hygiene — input bloat hurts local latency even more than cloud cost.
Need normalized cost math across cloud and local? Narev supports comparing provider rates; add your own GPU hourly rate for local rows.