llama.cpp
llama.cpp is a C/C++ inference runtime for GGUF-quantized models. It runs on CPUs, Apple Silicon, and modest GPUs with minimal dependencies — the stack behind many “local AI” tools and the engine inside Ollama and LocalAI.
Use llama.cpp when you need portable, low-footprint inference — not datacenter-scale batching.
Expected impact
| Scenario | Typical outcome |
|---|---|
| Offline laptop inference (Apple M-series) | Eliminates API cost for air-gapped work |
| CI smoke tests on CPU runners | Cheap validation without GPU nodes |
| Edge devices with tight RAM | GGUF quants fit where full weights cannot |
| Production API at high QPS | Usually wrong tool — use vLLM |
What makes llama.cpp different
| Property | llama.cpp | vLLM / TGI |
|---|---|---|
| Primary target | Single-machine, edge, CPU | Multi-GPU datacenter serving |
| Model format | GGUF quant files | Safetensors / HF Hub |
| Continuous batching | Limited | Core feature |
| Dependencies | Minimal (no Python required for llama-server) | Python + CUDA stack |
| Apple Silicon | Excellent Metal backend | Limited |
GGUF and quantization
Models ship as GGUF files with embedded quantization:
| Quant | Size vs FP16 | Quality | Typical use |
|---|---|---|---|
| Q4_K_M | ~4× smaller | Good for most tasks | Default starting point |
| Q5_K_M | ~3× smaller | Better than Q4 on code | IDE assistants |
| Q8_0 | ~2× smaller | Near full precision | Quality-sensitive local work |
| F16 | Baseline | Best | When VRAM allows |
Download from Hugging Face (TheBloke, bartowski, or official repos) or pull via Ollama which uses llama.cpp under the hood.
CLI usage
# Build from source (or use brew install llama.cpp)
./llama-cli -m models/llama-3.2-8b-instruct.Q4_K_M.gguf \
-p "Summarize: function add(a,b) { return a+b }" \
-n 64Flags that affect cost and quality:
-n/--predict— max tokens to generate (always set this)-c/--ctx-size— context window; larger = more RAM, slower-ngl/--gpu-layers— offload layers to GPU;99for full GPU on CUDA/Metal
Server mode (OpenAI-compatible)
llama-server exposes an HTTP API compatible with OpenAI clients:
./llama-server \
-m models/llama-3.2-8b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 8192 \
-ngl 99import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "llama",
});Use server mode when multiple tools (IDE, scripts, agents) share one local model process.
Apple Silicon notes
Metal acceleration on M1/M2/M3/M4 makes llama.cpp the default choice for Mac developers:
- Unified memory means “VRAM” is shared with the OS — monitor memory pressure
- 16 GB RAM: stick to 7B–8B Q4; 32 GB+: 13B–14B Q4/Q5 viable
- Thermal throttle on sustained agent loops — set session caps
Metering
llama.cpp does not ship Prometheus metrics. Instrument at the client or proxy:
| Metric | How to capture |
|---|---|
| Tokens in/out | Parse API response usage fields or count locally |
| Wall-clock latency | Middleware on your OpenAI client |
| CPU/GPU utilization | top, Activity Monitor, nvidia-smi |
Tag spans with inference.backend=llama-cpp for fair comparison against Ollama and cloud APIs.
Workload fit
| Profile | Fit |
|---|---|
| Coding assistants (solo dev) | Strong on Mac/Linux with Q5 8B |
| Batch / offline at scale | Weak — throughput per dollar lags vLLM on GPU |
| RAG pipelines | Acceptable for prototypes; trim chunks aggressively |
| Multi-tenant API | Poor — no production batching layer |
Guardrails
- Set
-n(max tokens) on every CLI invocation - Cap
-c(context) per task — do not default to 128K - For shared
llama-server, add reverse-proxy rate limits - Hybrid route: local llama.cpp for draft, cloud for production commits
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| CPU-only 70B inference for interactive IDE | Latency measured in minutes |
| Max context on 8 GB RAM Mac | Swap thrashing, system freeze |
| Skipping quant benchmarks | Q2 models save RAM but fail code tasks |
| Running llama-server without auth on LAN | Open relay on your network |
When to graduate
| Signal | Move to |
|---|---|
Need ollama pull ergonomics | Ollama |
| Multiple backends, one API URL | LocalAI |
| >50 concurrent users, GPU cluster | vLLM |
| Hugging Face enterprise deploy | TGI |
Related
- Self-hosting — stack comparison
- Ollama — llama.cpp with batteries included
- Context hygiene — trim before local inference
- llama.cpp repository