vLLM
vLLM is a production inference engine built for throughput. Its PagedAttention memory manager and continuous batching let one GPU serve many concurrent requests efficiently — the stack teams reach for when Ollama’s single-user ergonomics stop scaling.
This guide covers when vLLM beats cloud APIs, how to size GPUs, and how to instrument serving for cost per outcome.
Expected impact
| Scenario | Typical outcome |
|---|---|
| High QPS on a fixed model (support bot, classification) | 70–95% vs equivalent cloud API at >60% GPU utilization |
| Low QPS with idle GPU | Negative — depreciation exceeds API spend |
| Replacing batch jobs that already use cloud batch APIs | Benchmark required; savings depend on job parallelism |
vLLM pays off when utilization is high and model choice is stable. It does not pay off for sporadic dev experiments — use Ollama instead.
Architecture overview
Client (OpenAI SDK, LangChain, etc.)
↓ HTTP
vLLM OpenAI-compatible server (:8000)
↓
PagedAttention + continuous batching
↓
GPU(s) — CUDA, ROCm, or TPUKey mechanisms:
- Continuous batching — new requests join an in-flight batch without waiting for the entire batch to finish
- PagedAttention — KV cache stored in non-contiguous pages, reducing memory fragmentation
- Tensor parallelism — split large models across multiple GPUs
Quick start
pip install vllm
# OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "vllm",
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-8B-Instruct",
messages: [{ role: "user", content: "Classify intent: refund request" }],
max_tokens: 64,
});GPU sizing
Undersized GPUs cause OOM; oversized GPUs idle. Start from model weights + KV cache headroom:
| Model size | Minimum VRAM (FP16) | Practical note |
|---|---|---|
| 7B–8B | 16 GB | Good single-GPU serving entry point |
| 13B–14B | 24–32 GB | Quantization (AWQ, GPTQ) reduces requirement |
| 70B | 2× 40 GB+ or 1× 80 GB | Tensor parallelism required |
Use quantization when quality benchmarks pass:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization awq \
--dtype autoAlways benchmark quantized vs full precision on your eval set before cutting cloud traffic.
Throughput tuning
| Knob | Effect |
|---|---|
--max-num-seqs | Max concurrent sequences; raise until latency SLO breaks |
--gpu-memory-utilization | Fraction of VRAM for KV cache (default 0.9) |
--max-model-len | Cap context length — lowers KV memory, improves throughput |
--enable-prefix-caching | Reuse KV blocks for identical prefixes (like provider prompt caching) |
Prohibited: setting max-model-len to 128K “just in case” on every request.
Required: cap context per workload profile; RAG pipelines trim retrieval before the model sees it.
Metering and observability
vLLM exposes Prometheus metrics at /metrics. Track:
| Metric | Use |
|---|---|
vllm:gpu_cache_usage_perc | KV cache pressure — scale or trim context when sustained high |
vllm:num_requests_running | Concurrency — compare to SLO |
vllm:time_to_first_token_seconds | TTFT for streaming UX |
vllm:time_per_output_token_seconds | Generation speed |
Attribute cost per feature with request tags (via OpenTelemetry or your API gateway):
{
"feature": "support-intent-classifier",
"inference.backend": "vllm",
"model": "llama-3.1-8b",
"gpu_seconds": 0.042,
"input_tokens": 312,
"output_tokens": 18
}Convert GPU-seconds to dollars: (GPU hourly rate / 3600) × gpu_seconds. Compare to Narev cloud pricing for the same token counts.
Production deployment patterns
Single-node serving
Fine for internal tools and moderate traffic. Put a reverse proxy (nginx, Envoy) in front for TLS, rate limits, and auth.
Kubernetes
Use the official Helm chart or custom deployments with:
- HPA on GPU metrics — scale replicas when queue depth or latency exceeds SLO
- Readiness probes —
GET /healthbefore routing traffic - Model artifact caching — init containers or PVCs for large weights
Multi-model routing
Run separate vLLM deployments per model tier. Your router maps capabilities to endpoints — same pattern as Model routing, but endpoints are URLs instead of provider model IDs.
classify-intent → vllm-8b.internal:8000
complex-reason → cloud-mid-tier API (escalation logged)Hybrid cloud fallback
vLLM does not eliminate cloud APIs — it reduces the volume that needs frontier quality.
Request → vLLM (8B instruct)
↓ confidence / eval score passes → return
↓ fails → cloud mid-tier or frontier (logged)Log every escalation with the cheaper attempt, failure reason, and cost delta. Unlogged escalations hide whether self-hosting actually works.
Guardrails
| Tier | Enforcement |
|---|---|
| Soft | Prometheus alerts on P95 latency and GPU cache >85% |
| Medium | API gateway rate limits per API key / feature tag |
| Hard | max_tokens enforced at gateway; CI blocks unbounded agent configs |
Align with Article IV: fiscal ceilings — GPU-hour budgets are inference budgets.
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| Deploying vLLM for 10 requests/day | Ops overhead exceeds API cost |
| No quantization on memory-bound GPUs | OOM or tiny batch sizes |
| Serving 70B when 8B passes eval | 5–10× hardware for marginal quality gain |
| Ignoring prefix caching for RAG with stable system prompts | Recomputes KV cache every request |
| Skipping auth on the OpenAI endpoint | Your GPU becomes a public mine |
Troubleshooting
OOM on startup — reduce --max-model-len, enable quantization, or add GPUs with tensor parallelism.
High TTFT, low throughput — batch size too small; increase --max-num-seqs until latency breaks SLO.
Quality regression vs cloud — eval set too small or wrong; expand benchmark before blaming the model.
Prefix cache not hitting — dynamic content in system prompt prefix; move static content first per Prompt caching.
vLLM vs alternatives
| Need | Better choice |
|---|---|
| Pull and chat in 60 seconds | Ollama |
| Hugging Face Hub + enterprise K8s | Text Generation Inference |
| CPU-only or Apple Silicon laptop | llama.cpp |
| One OpenAI URL over many backends | LocalAI |
| Maximum CUDA throughput, open weights | vLLM |
Related
- Self-hosting — cost model and workload fit
- Model routing — capability-based endpoint selection
- Prompt caching — prefix stability for KV reuse
- vLLM documentation