Skip to Content
Self-hostingllama.cpp

llama.cpp

llama.cpp  is a C/C++ inference runtime for GGUF-quantized models. It runs on CPUs, Apple Silicon, and modest GPUs with minimal dependencies — the stack behind many “local AI” tools and the engine inside Ollama and LocalAI.

Use llama.cpp when you need portable, low-footprint inference — not datacenter-scale batching.

Expected impact

ScenarioTypical outcome
Offline laptop inference (Apple M-series)Eliminates API cost for air-gapped work
CI smoke tests on CPU runnersCheap validation without GPU nodes
Edge devices with tight RAMGGUF quants fit where full weights cannot
Production API at high QPSUsually wrong tool — use vLLM

What makes llama.cpp different

Propertyllama.cppvLLM / TGI
Primary targetSingle-machine, edge, CPUMulti-GPU datacenter serving
Model formatGGUF quant filesSafetensors / HF Hub
Continuous batchingLimitedCore feature
DependenciesMinimal (no Python required for llama-server)Python + CUDA stack
Apple SiliconExcellent Metal backendLimited

GGUF and quantization

Models ship as GGUF files with embedded quantization:

QuantSize vs FP16QualityTypical use
Q4_K_M~4× smallerGood for most tasksDefault starting point
Q5_K_M~3× smallerBetter than Q4 on codeIDE assistants
Q8_0~2× smallerNear full precisionQuality-sensitive local work
F16BaselineBestWhen VRAM allows

Download from Hugging Face (TheBloke, bartowski, or official repos) or pull via Ollama which uses llama.cpp under the hood.

CLI usage

# Build from source (or use brew install llama.cpp) ./llama-cli -m models/llama-3.2-8b-instruct.Q4_K_M.gguf \ -p "Summarize: function add(a,b) { return a+b }" \ -n 64

Flags that affect cost and quality:

  • -n / --predict — max tokens to generate (always set this)
  • -c / --ctx-size — context window; larger = more RAM, slower
  • -ngl / --gpu-layers — offload layers to GPU; 99 for full GPU on CUDA/Metal

Server mode (OpenAI-compatible)

llama-server exposes an HTTP API compatible with OpenAI clients:

./llama-server \ -m models/llama-3.2-8b-instruct.Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -c 8192 \ -ngl 99
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:8080/v1", apiKey: "llama", });

Use server mode when multiple tools (IDE, scripts, agents) share one local model process.

Apple Silicon notes

Metal acceleration on M1/M2/M3/M4 makes llama.cpp the default choice for Mac developers:

  • Unified memory means “VRAM” is shared with the OS — monitor memory pressure
  • 16 GB RAM: stick to 7B–8B Q4; 32 GB+: 13B–14B Q4/Q5 viable
  • Thermal throttle on sustained agent loops — set session caps

Metering

llama.cpp does not ship Prometheus metrics. Instrument at the client or proxy:

MetricHow to capture
Tokens in/outParse API response usage fields or count locally
Wall-clock latencyMiddleware on your OpenAI client
CPU/GPU utilizationtop, Activity Monitor, nvidia-smi

Tag spans with inference.backend=llama-cpp for fair comparison against Ollama and cloud APIs.

Workload fit

ProfileFit
Coding assistants (solo dev)Strong on Mac/Linux with Q5 8B
Batch / offline at scaleWeak — throughput per dollar lags vLLM on GPU
RAG pipelinesAcceptable for prototypes; trim chunks aggressively
Multi-tenant APIPoor — no production batching layer

Guardrails

  • Set -n (max tokens) on every CLI invocation
  • Cap -c (context) per task — do not default to 128K
  • For shared llama-server, add reverse-proxy rate limits
  • Hybrid route: local llama.cpp for draft, cloud for production commits

Anti-patterns

Anti-patternWhy it fails
CPU-only 70B inference for interactive IDELatency measured in minutes
Max context on 8 GB RAM MacSwap thrashing, system freeze
Skipping quant benchmarksQ2 models save RAM but fail code tasks
Running llama-server without auth on LANOpen relay on your network

When to graduate

SignalMove to
Need ollama pull ergonomicsOllama
Multiple backends, one API URLLocalAI
>50 concurrent users, GPU clustervLLM
Hugging Face enterprise deployTGI
Last updated on