Self-hosting – Tokenminning

Self-hosting means running open-weight models on hardware you control instead of paying per-token API rates. The trade is fixed GPU cost and ops work for predictable unit economics at high volume — not a free lunch.

Self-hosting is the last lever in the optimization stack. Start with the Local inference practice guide, then use the deep dives below for stack-specific setup, metering, and guardrails.

When self-hosting wins

Signal	What it means
Steady high QPS on a narrow task mix	Fixed GPU amortizes faster than API tokens
Data cannot leave your network	Compliance or IP constraints block cloud APIs
Predictable batch windows	Overnight jobs, eval pipelines, internal tools
You already own GPU capacity	Idle cluster time is cheaper than marginal API spend

When cloud APIs still win

Signal	What it means
Spiky or exploratory traffic	Pay-per-token absorbs variance; GPUs sit idle
You need frontier quality	Open weights lag closed models on hard reasoning
Small team, no inference SRE	Serving, upgrades, and incident response are real work
Low monthly inference spend	Hardware depreciation exceeds API bills

Cost model: tokens vs GPU time

Cloud APIs bill input + output tokens. Self-hosted stacks bill GPU-seconds, VRAM headroom, and engineer time.

Compare fairly:

Normalize to cost per outcome — resolved tickets, classified rows, merged PRs — not raw token volume.
Meter wall-clock and GPU utilization — a 7B model at 30% GPU load is not “free” because there is no API invoice.
Include idle capacity — a dedicated dev GPU that runs Ollama two hours a day still depreciates 24/7 unless shared.
Add routing overhead — hybrid stacks (local for draft, cloud for commit) need explicit escalation logs per Article II.

Narev helps compare cloud token rates when you benchmark local quality against API baselines.

Workload fit

Workload profile	Typical self-host fit	First stack to evaluate
Coding assistants	Exploration and completions on dev GPUs	Ollama
Batch / offline	High-volume classification, summarization, eval	vLLM or TGI
Edge / laptop	Air-gapped demos, travel, CI smoke tests	llama.cpp
OpenAI-compatible drop-in	Swap `baseURL` without rewriting SDK calls	LocalAI or Ollama’s API mode
Multi-tenant production	Shared serving with batching and autoscaling	vLLM or TGI behind a gateway

Stack comparison

Stack	Best for	Throughput	Ops complexity	OpenAI-compatible API
Ollama	Local dev, IDE integration, quick experiments	Low–medium (single user)	Low	Yes
vLLM	Production serving, continuous batching, high QPS	High	Medium–high	Yes
llama.cpp	CPU/GPU edge, minimal deps, GGUF models	Low–medium	Low	Yes (server mode)
TGI	Hugging Face models, enterprise K8s	High	Medium	Yes
LocalAI	Unified gateway over multiple backends	Depends on backend	Medium	Yes

Decision flow

Progressive guardrails

Self-hosted inference still needs budgets — measured in GPU hours and queue depth, not FinOps chargeback theater.

Tier	What to enforce
Soft	Log GPU time per feature tag; alert when local queue exceeds P95
Medium	Route exploratory traffic to local models; escalate to cloud on quality failure
Hard	Per-session token ceilings; kill runaway agent loops; CI blocks unbounded `max_tokens`

See Article IV: fiscal ceilings for runtime cap patterns that apply to both cloud and on-prem.

Deep dives

Guide	One-line summary
Ollama	Fastest path from `ollama pull` to IDE assistant — dev GPU metering and hybrid routing
vLLM	Production throughput with PagedAttention and continuous batching
llama.cpp	GGUF inference on CPU, Apple Silicon, or modest GPUs
Text Generation Inference	Hugging Face–native serving for K8s and enterprise deployments
LocalAI	OpenAI-compatible gateway that fronts Ollama, llama.cpp, vLLM, and more

Local inference — practice guide: when local beats cloud, metering, IDE wiring
Where to start — optimization sequence; local inference is step 8
Model routing — hybrid local/cloud escalation
Tokenminning in Zed — Ollama integration in a GPU-rendered IDE
Model selection — quality bars before you swap providers