Text Generation Inference
Text Generation Inference (TGI) is Hugging Face’s production serving stack for transformer models. It targets teams already on the HF Hub who want OpenAI-compatible APIs, Kubernetes-native scaling, and enterprise support paths.
Choose TGI when your models live on Hugging Face and your deployment target is Kubernetes at scale — not when you want a laptop dev server (use Ollama or llama.cpp).
Expected impact
| Scenario | Typical outcome |
|---|---|
| Steady production traffic on HF models | 60–90% vs cloud API at high GPU utilization |
| Sporadic internal experiments | Negative — cluster overhead dominates |
| Models requiring HF-specific features (flash attention, custom code) | TGI reduces glue code vs rolling your own vLLM config |
Benchmark against vLLM on your hardware — both are strong; the winner is workload-specific.
Architecture
Client → Ingress / API Gateway
↓
TGI router (K8s Service)
↓
TGI replicas (continuous batching)
↓
GPU node poolTGI handles:
- Continuous batching across concurrent requests
- Tensor parallelism for large models
- Quantization (bitsandbytes, GPTQ, AWQ depending on version)
- OpenAI-compatible
/v1/chat/completionsendpoint
Quick start (Docker)
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-length 4096 \
--max-total-tokens 8192import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "tgi",
});Kubernetes deployment
Production TGI runs as a Deployment with GPU node selectors:
| Concern | Pattern |
|---|---|
| Model weights | Init container or PVC synced from HF Hub; use HF_TOKEN for gated models |
| Scaling | HPA on custom metrics (queue depth, GPU utilization) |
| Upgrades | Blue/green replica sets — model swaps are deploy events |
| Secrets | HF_TOKEN in K8s secrets; never bake into images |
Hugging Face offers Inference Endpoints as managed TGI — useful when you want HF ops without running the cluster yourself. Meter managed endpoints like cloud APIs; self-managed TGI meters GPU-hours.
Model selection on the Hub
TGI loads directly from model-id:
--model-id mistralai/Mistral-7B-Instruct-v0.3Consider:
- License — Llama, Mistral, and others have use restrictions
- Gated models — require HF account approval and token
- Custom modeling code —
trust-remote-codeflag when required (audit the code)
Match model tier to task per Model routing — do not serve 70B when 8B passes your eval.
Throughput and memory
| Parameter | Effect |
|---|---|
--max-input-length | Caps prompt size — protects KV memory |
--max-total-tokens | Input + output ceiling per request |
--max-batch-prefill-tokens | Prefill batch budget |
--quantize | AWQ/GPTQ paths when supported |
Required: enforce max_tokens at the client and server. TGI rejects oversize requests when limits are configured.
Observability
TGI exposes Prometheus metrics. Key signals:
| Metric area | Action when bad |
|---|---|
| Queue depth sustained high | Scale replicas or reduce max-input-length |
| Time to first token | Check batch contention; consider dedicated model pools |
| GPU memory | Enable quantization or split models across GPUs |
Attribute spans with inference.backend=tgi and model_id for cost-per-feature dashboards.
TGI vs vLLM
| Factor | TGI | vLLM |
|---|---|---|
| HF Hub integration | Native | Supported via model paths |
| Ecosystem momentum | HF enterprise path | Broad OSS community |
| Feature velocity | Tied to HF releases | Fast iteration on CUDA kernels |
| Managed option | Inference Endpoints | Third-party hosts |
Run both on your eval set and pick on measured throughput, latency, and quality — not brand preference.
Guardrails
| Tier | Enforcement |
|---|---|
| Soft | Alerts on queue depth and P95 TTFT |
| Medium | Per-tenant rate limits at ingress |
| Hard | max-total-tokens enforced; CI blocks agents without output caps |
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| TGI for a single developer laptop | Docker + GPU passthrough friction for no gain |
max-input-length at model maximum for all routes | KV cache exhaustion; terrible batching |
| No HF token rotation | Gated model pulls fail silently in CI |
| One giant deployment for unrelated model tiers | Noisy neighbor latency; cannot scale tiers independently |
Troubleshooting
Model download slow on pod start — pre-bake weights into PVC or use HF cache volumes.
OOM after deploy — lower --max-input-length or enable quantization.
OpenAI client 404 — confirm /v1/chat/completions path and TGI version; older builds differ.
Quality below vLLM on same hardware — quantization or batch settings; re-benchmark with matched configs.
Related
- Self-hosting — when on-prem beats cloud
- vLLM — alternative production engine
- Output and RAG — trim retrieval before TGI sees it
- TGI documentation