Skip to Content
Self-hosting

Self-hosting

Self-hosting means running open-weight models on hardware you control instead of paying per-token API rates. The trade is fixed GPU cost and ops work for predictable unit economics at high volume — not a free lunch.

Self-hosting is the last lever in the optimization stack. Start with the Local inference practice guide, then use the deep dives below for stack-specific setup, metering, and guardrails.

When self-hosting wins

SignalWhat it means
Steady high QPS on a narrow task mixFixed GPU amortizes faster than API tokens
Data cannot leave your networkCompliance or IP constraints block cloud APIs
Predictable batch windowsOvernight jobs, eval pipelines, internal tools
You already own GPU capacityIdle cluster time is cheaper than marginal API spend

When cloud APIs still win

SignalWhat it means
Spiky or exploratory trafficPay-per-token absorbs variance; GPUs sit idle
You need frontier qualityOpen weights lag closed models on hard reasoning
Small team, no inference SREServing, upgrades, and incident response are real work
Low monthly inference spendHardware depreciation exceeds API bills

Cost model: tokens vs GPU time

Cloud APIs bill input + output tokens. Self-hosted stacks bill GPU-seconds, VRAM headroom, and engineer time.

Compare fairly:

  1. Normalize to cost per outcome — resolved tickets, classified rows, merged PRs — not raw token volume.
  2. Meter wall-clock and GPU utilization — a 7B model at 30% GPU load is not “free” because there is no API invoice.
  3. Include idle capacity — a dedicated dev GPU that runs Ollama two hours a day still depreciates 24/7 unless shared.
  4. Add routing overhead — hybrid stacks (local for draft, cloud for commit) need explicit escalation logs per Article II.

Narev  helps compare cloud token rates when you benchmark local quality against API baselines.

Workload fit

Workload profileTypical self-host fitFirst stack to evaluate
Coding assistantsExploration and completions on dev GPUsOllama
Batch / offlineHigh-volume classification, summarization, evalvLLM or TGI
Edge / laptopAir-gapped demos, travel, CI smoke testsllama.cpp
OpenAI-compatible drop-inSwap baseURL without rewriting SDK callsLocalAI or Ollama’s API mode
Multi-tenant productionShared serving with batching and autoscalingvLLM or TGI behind a gateway

Stack comparison

StackBest forThroughputOps complexityOpenAI-compatible API
OllamaLocal dev, IDE integration, quick experimentsLow–medium (single user)LowYes
vLLMProduction serving, continuous batching, high QPSHighMedium–highYes
llama.cppCPU/GPU edge, minimal deps, GGUF modelsLow–mediumLowYes (server mode)
TGIHugging Face models, enterprise K8sHighMediumYes
LocalAIUnified gateway over multiple backendsDepends on backendMediumYes

Decision flow

Progressive guardrails

Self-hosted inference still needs budgets — measured in GPU hours and queue depth, not FinOps chargeback theater.

TierWhat to enforce
SoftLog GPU time per feature tag; alert when local queue exceeds P95
MediumRoute exploratory traffic to local models; escalate to cloud on quality failure
HardPer-session token ceilings; kill runaway agent loops; CI blocks unbounded max_tokens

See Article IV: fiscal ceilings for runtime cap patterns that apply to both cloud and on-prem.

Deep dives

GuideOne-line summary
OllamaFastest path from ollama pull to IDE assistant — dev GPU metering and hybrid routing
vLLMProduction throughput with PagedAttention and continuous batching
llama.cppGGUF inference on CPU, Apple Silicon, or modest GPUs
Text Generation InferenceHugging Face–native serving for K8s and enterprise deployments
LocalAIOpenAI-compatible gateway that fronts Ollama, llama.cpp, vLLM, and more
Last updated on