Skip to Content

vLLM

vLLM  is a production inference engine built for throughput. Its PagedAttention memory manager and continuous batching let one GPU serve many concurrent requests efficiently — the stack teams reach for when Ollama’s single-user ergonomics stop scaling.

This guide covers when vLLM beats cloud APIs, how to size GPUs, and how to instrument serving for cost per outcome.

Expected impact

ScenarioTypical outcome
High QPS on a fixed model (support bot, classification)70–95% vs equivalent cloud API at >60% GPU utilization
Low QPS with idle GPUNegative — depreciation exceeds API spend
Replacing batch jobs that already use cloud batch APIsBenchmark required; savings depend on job parallelism

vLLM pays off when utilization is high and model choice is stable. It does not pay off for sporadic dev experiments — use Ollama instead.

Architecture overview

Client (OpenAI SDK, LangChain, etc.) ↓ HTTP vLLM OpenAI-compatible server (:8000) PagedAttention + continuous batching GPU(s) — CUDA, ROCm, or TPU

Key mechanisms:

  • Continuous batching — new requests join an in-flight batch without waiting for the entire batch to finish
  • PagedAttention — KV cache stored in non-contiguous pages, reducing memory fragmentation
  • Tensor parallelism — split large models across multiple GPUs

Quick start

pip install vllm # OpenAI-compatible server vllm serve meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 8192
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "vllm", }); const response = await client.chat.completions.create({ model: "meta-llama/Llama-3.1-8B-Instruct", messages: [{ role: "user", content: "Classify intent: refund request" }], max_tokens: 64, });

GPU sizing

Undersized GPUs cause OOM; oversized GPUs idle. Start from model weights + KV cache headroom:

Model sizeMinimum VRAM (FP16)Practical note
7B–8B16 GBGood single-GPU serving entry point
13B–14B24–32 GBQuantization (AWQ, GPTQ) reduces requirement
70B2× 40 GB+ or 1× 80 GBTensor parallelism required

Use quantization when quality benchmarks pass:

vllm serve meta-llama/Llama-3.1-8B-Instruct \ --quantization awq \ --dtype auto

Always benchmark quantized vs full precision on your eval set before cutting cloud traffic.

Throughput tuning

KnobEffect
--max-num-seqsMax concurrent sequences; raise until latency SLO breaks
--gpu-memory-utilizationFraction of VRAM for KV cache (default 0.9)
--max-model-lenCap context length — lowers KV memory, improves throughput
--enable-prefix-cachingReuse KV blocks for identical prefixes (like provider prompt caching)

Prohibited: setting max-model-len to 128K “just in case” on every request.

Required: cap context per workload profile; RAG pipelines trim retrieval before the model sees it.

Metering and observability

vLLM exposes Prometheus metrics at /metrics. Track:

MetricUse
vllm:gpu_cache_usage_percKV cache pressure — scale or trim context when sustained high
vllm:num_requests_runningConcurrency — compare to SLO
vllm:time_to_first_token_secondsTTFT for streaming UX
vllm:time_per_output_token_secondsGeneration speed

Attribute cost per feature with request tags (via OpenTelemetry or your API gateway):

{ "feature": "support-intent-classifier", "inference.backend": "vllm", "model": "llama-3.1-8b", "gpu_seconds": 0.042, "input_tokens": 312, "output_tokens": 18 }

Convert GPU-seconds to dollars: (GPU hourly rate / 3600) × gpu_seconds. Compare to Narev  cloud pricing for the same token counts.

Production deployment patterns

Single-node serving

Fine for internal tools and moderate traffic. Put a reverse proxy (nginx, Envoy) in front for TLS, rate limits, and auth.

Kubernetes

Use the official Helm chart or custom deployments with:

  • HPA on GPU metrics — scale replicas when queue depth or latency exceeds SLO
  • Readiness probesGET /health before routing traffic
  • Model artifact caching — init containers or PVCs for large weights

Multi-model routing

Run separate vLLM deployments per model tier. Your router maps capabilities to endpoints — same pattern as Model routing, but endpoints are URLs instead of provider model IDs.

classify-intent → vllm-8b.internal:8000 complex-reason → cloud-mid-tier API (escalation logged)

Hybrid cloud fallback

vLLM does not eliminate cloud APIs — it reduces the volume that needs frontier quality.

Request → vLLM (8B instruct) ↓ confidence / eval score passes → return ↓ fails → cloud mid-tier or frontier (logged)

Log every escalation with the cheaper attempt, failure reason, and cost delta. Unlogged escalations hide whether self-hosting actually works.

Guardrails

TierEnforcement
SoftPrometheus alerts on P95 latency and GPU cache >85%
MediumAPI gateway rate limits per API key / feature tag
Hardmax_tokens enforced at gateway; CI blocks unbounded agent configs

Align with Article IV: fiscal ceilings — GPU-hour budgets are inference budgets.

Anti-patterns

Anti-patternWhy it fails
Deploying vLLM for 10 requests/dayOps overhead exceeds API cost
No quantization on memory-bound GPUsOOM or tiny batch sizes
Serving 70B when 8B passes eval5–10× hardware for marginal quality gain
Ignoring prefix caching for RAG with stable system promptsRecomputes KV cache every request
Skipping auth on the OpenAI endpointYour GPU becomes a public mine

Troubleshooting

OOM on startup — reduce --max-model-len, enable quantization, or add GPUs with tensor parallelism.

High TTFT, low throughput — batch size too small; increase --max-num-seqs until latency breaks SLO.

Quality regression vs cloud — eval set too small or wrong; expand benchmark before blaming the model.

Prefix cache not hitting — dynamic content in system prompt prefix; move static content first per Prompt caching.

vLLM vs alternatives

NeedBetter choice
Pull and chat in 60 secondsOllama
Hugging Face Hub + enterprise K8sText Generation Inference
CPU-only or Apple Silicon laptopllama.cpp
One OpenAI URL over many backendsLocalAI
Maximum CUDA throughput, open weightsvLLM
Last updated on