Skip to Content
Self-hostingText Generation Inference

Text Generation Inference

Text Generation Inference  (TGI) is Hugging Face’s production serving stack for transformer models. It targets teams already on the HF Hub who want OpenAI-compatible APIs, Kubernetes-native scaling, and enterprise support paths.

Choose TGI when your models live on Hugging Face and your deployment target is Kubernetes at scale — not when you want a laptop dev server (use Ollama or llama.cpp).

Expected impact

ScenarioTypical outcome
Steady production traffic on HF models60–90% vs cloud API at high GPU utilization
Sporadic internal experimentsNegative — cluster overhead dominates
Models requiring HF-specific features (flash attention, custom code)TGI reduces glue code vs rolling your own vLLM config

Benchmark against vLLM on your hardware — both are strong; the winner is workload-specific.

Architecture

Client → Ingress / API Gateway TGI router (K8s Service) TGI replicas (continuous batching) GPU node pool

TGI handles:

  • Continuous batching across concurrent requests
  • Tensor parallelism for large models
  • Quantization (bitsandbytes, GPTQ, AWQ depending on version)
  • OpenAI-compatible /v1/chat/completions endpoint

Quick start (Docker)

docker run --gpus all --shm-size 1g -p 8080:80 \ -v $PWD/data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --max-input-length 4096 \ --max-total-tokens 8192
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "tgi", });

Kubernetes deployment

Production TGI runs as a Deployment with GPU node selectors:

ConcernPattern
Model weightsInit container or PVC synced from HF Hub; use HF_TOKEN for gated models
ScalingHPA on custom metrics (queue depth, GPU utilization)
UpgradesBlue/green replica sets — model swaps are deploy events
SecretsHF_TOKEN in K8s secrets; never bake into images

Hugging Face offers Inference Endpoints  as managed TGI — useful when you want HF ops without running the cluster yourself. Meter managed endpoints like cloud APIs; self-managed TGI meters GPU-hours.

Model selection on the Hub

TGI loads directly from model-id:

--model-id mistralai/Mistral-7B-Instruct-v0.3

Consider:

  • License — Llama, Mistral, and others have use restrictions
  • Gated models — require HF account approval and token
  • Custom modeling codetrust-remote-code flag when required (audit the code)

Match model tier to task per Model routing — do not serve 70B when 8B passes your eval.

Throughput and memory

ParameterEffect
--max-input-lengthCaps prompt size — protects KV memory
--max-total-tokensInput + output ceiling per request
--max-batch-prefill-tokensPrefill batch budget
--quantizeAWQ/GPTQ paths when supported

Required: enforce max_tokens at the client and server. TGI rejects oversize requests when limits are configured.

Observability

TGI exposes Prometheus metrics. Key signals:

Metric areaAction when bad
Queue depth sustained highScale replicas or reduce max-input-length
Time to first tokenCheck batch contention; consider dedicated model pools
GPU memoryEnable quantization or split models across GPUs

Attribute spans with inference.backend=tgi and model_id for cost-per-feature dashboards.

TGI vs vLLM

FactorTGIvLLM
HF Hub integrationNativeSupported via model paths
Ecosystem momentumHF enterprise pathBroad OSS community
Feature velocityTied to HF releasesFast iteration on CUDA kernels
Managed optionInference EndpointsThird-party hosts

Run both on your eval set and pick on measured throughput, latency, and quality — not brand preference.

Guardrails

TierEnforcement
SoftAlerts on queue depth and P95 TTFT
MediumPer-tenant rate limits at ingress
Hardmax-total-tokens enforced; CI blocks agents without output caps

Anti-patterns

Anti-patternWhy it fails
TGI for a single developer laptopDocker + GPU passthrough friction for no gain
max-input-length at model maximum for all routesKV cache exhaustion; terrible batching
No HF token rotationGated model pulls fail silently in CI
One giant deployment for unrelated model tiersNoisy neighbor latency; cannot scale tiers independently

Troubleshooting

Model download slow on pod start — pre-bake weights into PVC or use HF cache volumes.

OOM after deploy — lower --max-input-length or enable quantization.

OpenAI client 404 — confirm /v1/chat/completions path and TGI version; older builds differ.

Quality below vLLM on same hardware — quantization or batch settings; re-benchmark with matched configs.

Last updated on