Self-hosting
Self-hosting means running open-weight models on hardware you control instead of paying per-token API rates. The trade is fixed GPU cost and ops work for predictable unit economics at high volume — not a free lunch.
Self-hosting is the last lever in the optimization stack. Start with the Local inference practice guide, then use the deep dives below for stack-specific setup, metering, and guardrails.
When self-hosting wins
| Signal | What it means |
|---|---|
| Steady high QPS on a narrow task mix | Fixed GPU amortizes faster than API tokens |
| Data cannot leave your network | Compliance or IP constraints block cloud APIs |
| Predictable batch windows | Overnight jobs, eval pipelines, internal tools |
| You already own GPU capacity | Idle cluster time is cheaper than marginal API spend |
When cloud APIs still win
| Signal | What it means |
|---|---|
| Spiky or exploratory traffic | Pay-per-token absorbs variance; GPUs sit idle |
| You need frontier quality | Open weights lag closed models on hard reasoning |
| Small team, no inference SRE | Serving, upgrades, and incident response are real work |
| Low monthly inference spend | Hardware depreciation exceeds API bills |
Cost model: tokens vs GPU time
Cloud APIs bill input + output tokens. Self-hosted stacks bill GPU-seconds, VRAM headroom, and engineer time.
Compare fairly:
- Normalize to cost per outcome — resolved tickets, classified rows, merged PRs — not raw token volume.
- Meter wall-clock and GPU utilization — a 7B model at 30% GPU load is not “free” because there is no API invoice.
- Include idle capacity — a dedicated dev GPU that runs Ollama two hours a day still depreciates 24/7 unless shared.
- Add routing overhead — hybrid stacks (local for draft, cloud for commit) need explicit escalation logs per Article II.
Narev helps compare cloud token rates when you benchmark local quality against API baselines.
Workload fit
| Workload profile | Typical self-host fit | First stack to evaluate |
|---|---|---|
| Coding assistants | Exploration and completions on dev GPUs | Ollama |
| Batch / offline | High-volume classification, summarization, eval | vLLM or TGI |
| Edge / laptop | Air-gapped demos, travel, CI smoke tests | llama.cpp |
| OpenAI-compatible drop-in | Swap baseURL without rewriting SDK calls | LocalAI or Ollama’s API mode |
| Multi-tenant production | Shared serving with batching and autoscaling | vLLM or TGI behind a gateway |
Stack comparison
| Stack | Best for | Throughput | Ops complexity | OpenAI-compatible API |
|---|---|---|---|---|
| Ollama | Local dev, IDE integration, quick experiments | Low–medium (single user) | Low | Yes |
| vLLM | Production serving, continuous batching, high QPS | High | Medium–high | Yes |
| llama.cpp | CPU/GPU edge, minimal deps, GGUF models | Low–medium | Low | Yes (server mode) |
| TGI | Hugging Face models, enterprise K8s | High | Medium | Yes |
| LocalAI | Unified gateway over multiple backends | Depends on backend | Medium | Yes |
Decision flow
Progressive guardrails
Self-hosted inference still needs budgets — measured in GPU hours and queue depth, not FinOps chargeback theater.
| Tier | What to enforce |
|---|---|
| Soft | Log GPU time per feature tag; alert when local queue exceeds P95 |
| Medium | Route exploratory traffic to local models; escalate to cloud on quality failure |
| Hard | Per-session token ceilings; kill runaway agent loops; CI blocks unbounded max_tokens |
See Article IV: fiscal ceilings for runtime cap patterns that apply to both cloud and on-prem.
Deep dives
| Guide | One-line summary |
|---|---|
| Ollama | Fastest path from ollama pull to IDE assistant — dev GPU metering and hybrid routing |
| vLLM | Production throughput with PagedAttention and continuous batching |
| llama.cpp | GGUF inference on CPU, Apple Silicon, or modest GPUs |
| Text Generation Inference | Hugging Face–native serving for K8s and enterprise deployments |
| LocalAI | OpenAI-compatible gateway that fronts Ollama, llama.cpp, vLLM, and more |
Related
- Local inference — practice guide: when local beats cloud, metering, IDE wiring
- Where to start — optimization sequence; local inference is step 8
- Model routing — hybrid local/cloud escalation
- Tokenminning in Zed — Ollama integration in a GPU-rendered IDE
- Model selection — quality bars before you swap providers