vLLM – Tokenminning – Tokenminning

vLLM is a production inference engine built for throughput. Its PagedAttention memory manager and continuous batching let one GPU serve many concurrent requests efficiently — the stack teams reach for when Ollama’s single-user ergonomics stop scaling.

This guide covers when vLLM beats cloud APIs, how to size GPUs, and how to instrument serving for cost per outcome.

Expected impact

Scenario	Typical outcome
High QPS on a fixed model (support bot, classification)	70–95% vs equivalent cloud API at >60% GPU utilization
Low QPS with idle GPU	Negative — depreciation exceeds API spend
Replacing batch jobs that already use cloud batch APIs	Benchmark required; savings depend on job parallelism

vLLM pays off when utilization is high and model choice is stable. It does not pay off for sporadic dev experiments — use Ollama instead.

Architecture overview


Client (OpenAI SDK, LangChain, etc.)
        ↓ HTTP
vLLM OpenAI-compatible server (:8000)
        ↓
PagedAttention + continuous batching
        ↓
GPU(s) — CUDA, ROCm, or TPU

Key mechanisms:

Continuous batching — new requests join an in-flight batch without waiting for the entire batch to finish
PagedAttention — KV cache stored in non-contiguous pages, reducing memory fragmentation
Tensor parallelism — split large models across multiple GPUs

Quick start


pip install vllm
 
# OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192


import OpenAI from "openai";
 
const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "vllm",
});
 
const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [{ role: "user", content: "Classify intent: refund request" }],
  max_tokens: 64,
});

GPU sizing

Undersized GPUs cause OOM; oversized GPUs idle. Start from model weights + KV cache headroom:

Model size	Minimum VRAM (FP16)	Practical note
7B–8B	16 GB	Good single-GPU serving entry point
13B–14B	24–32 GB	Quantization (AWQ, GPTQ) reduces requirement
70B	2× 40 GB+ or 1× 80 GB	Tensor parallelism required

Use quantization when quality benchmarks pass:


vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization awq \
  --dtype auto

Always benchmark quantized vs full precision on your eval set before cutting cloud traffic.

Throughput tuning

Knob	Effect
`--max-num-seqs`	Max concurrent sequences; raise until latency SLO breaks
`--gpu-memory-utilization`	Fraction of VRAM for KV cache (default 0.9)
`--max-model-len`	Cap context length — lowers KV memory, improves throughput
`--enable-prefix-caching`	Reuse KV blocks for identical prefixes (like provider prompt caching)

Prohibited: setting max-model-len to 128K “just in case” on every request.

Required: cap context per workload profile; RAG pipelines trim retrieval before the model sees it.

Metering and observability

vLLM exposes Prometheus metrics at /metrics. Track:

Metric	Use
`vllm:gpu_cache_usage_perc`	KV cache pressure — scale or trim context when sustained high
`vllm:num_requests_running`	Concurrency — compare to SLO
`vllm:time_to_first_token_seconds`	TTFT for streaming UX
`vllm:time_per_output_token_seconds`	Generation speed

Attribute cost per feature with request tags (via OpenTelemetry or your API gateway):


{
  "feature": "support-intent-classifier",
  "inference.backend": "vllm",
  "model": "llama-3.1-8b",
  "gpu_seconds": 0.042,
  "input_tokens": 312,
  "output_tokens": 18
}

Convert GPU-seconds to dollars: (GPU hourly rate / 3600) × gpu_seconds. Compare to Narev cloud pricing for the same token counts.

Production deployment patterns

Single-node serving

Fine for internal tools and moderate traffic. Put a reverse proxy (nginx, Envoy) in front for TLS, rate limits, and auth.

Kubernetes

Use the official Helm chart or custom deployments with:

HPA on GPU metrics — scale replicas when queue depth or latency exceeds SLO
Readiness probes — GET /health before routing traffic
Model artifact caching — init containers or PVCs for large weights

Multi-model routing

Run separate vLLM deployments per model tier. Your router maps capabilities to endpoints — same pattern as Model routing, but endpoints are URLs instead of provider model IDs.


classify-intent  → vllm-8b.internal:8000
complex-reason   → cloud-mid-tier API (escalation logged)

Hybrid cloud fallback

vLLM does not eliminate cloud APIs — it reduces the volume that needs frontier quality.


Request → vLLM (8B instruct)
            ↓ confidence / eval score passes → return
            ↓ fails → cloud mid-tier or frontier (logged)

Log every escalation with the cheaper attempt, failure reason, and cost delta. Unlogged escalations hide whether self-hosting actually works.

Guardrails

Tier	Enforcement
Soft	Prometheus alerts on P95 latency and GPU cache >85%
Medium	API gateway rate limits per API key / feature tag
Hard	`max_tokens` enforced at gateway; CI blocks unbounded agent configs

Align with Article IV: fiscal ceilings — GPU-hour budgets are inference budgets.

Anti-patterns

Anti-pattern	Why it fails
Deploying vLLM for 10 requests/day	Ops overhead exceeds API cost
No quantization on memory-bound GPUs	OOM or tiny batch sizes
Serving 70B when 8B passes eval	5–10× hardware for marginal quality gain
Ignoring prefix caching for RAG with stable system prompts	Recomputes KV cache every request
Skipping auth on the OpenAI endpoint	Your GPU becomes a public mine

Troubleshooting

OOM on startup — reduce --max-model-len, enable quantization, or add GPUs with tensor parallelism.

High TTFT, low throughput — batch size too small; increase --max-num-seqs until latency breaks SLO.

Quality regression vs cloud — eval set too small or wrong; expand benchmark before blaming the model.

Prefix cache not hitting — dynamic content in system prompt prefix; move static content first per Prompt caching.

vLLM vs alternatives

Need	Better choice
Pull and chat in 60 seconds	Ollama
Hugging Face Hub + enterprise K8s	Text Generation Inference
CPU-only or Apple Silicon laptop	llama.cpp
One OpenAI URL over many backends	LocalAI
Maximum CUDA throughput, open weights	vLLM

Self-hosting — cost model and workload fit
Model routing — capability-based endpoint selection
Prompt caching — prefix stability for KV reuse
vLLM documentation