llama.cpp – Tokenminning

llama.cpp is a C/C++ inference runtime for GGUF-quantized models. It runs on CPUs, Apple Silicon, and modest GPUs with minimal dependencies — the stack behind many “local AI” tools and the engine inside Ollama and LocalAI.

Use llama.cpp when you need portable, low-footprint inference — not datacenter-scale batching.

Expected impact

Scenario	Typical outcome
Offline laptop inference (Apple M-series)	Eliminates API cost for air-gapped work
CI smoke tests on CPU runners	Cheap validation without GPU nodes
Edge devices with tight RAM	GGUF quants fit where full weights cannot
Production API at high QPS	Usually wrong tool — use vLLM

What makes llama.cpp different

Property	llama.cpp	vLLM / TGI
Primary target	Single-machine, edge, CPU	Multi-GPU datacenter serving
Model format	GGUF quant files	Safetensors / HF Hub
Continuous batching	Limited	Core feature
Dependencies	Minimal (no Python required for `llama-server`)	Python + CUDA stack
Apple Silicon	Excellent Metal backend	Limited

GGUF and quantization

Models ship as GGUF files with embedded quantization:

Quant	Size vs FP16	Quality	Typical use
Q4_K_M	~4× smaller	Good for most tasks	Default starting point
Q5_K_M	~3× smaller	Better than Q4 on code	IDE assistants
Q8_0	~2× smaller	Near full precision	Quality-sensitive local work
F16	Baseline	Best	When VRAM allows

Download from Hugging Face (TheBloke, bartowski, or official repos) or pull via Ollama which uses llama.cpp under the hood.

CLI usage


# Build from source (or use brew install llama.cpp)
./llama-cli -m models/llama-3.2-8b-instruct.Q4_K_M.gguf \
  -p "Summarize: function add(a,b) { return a+b }" \
  -n 64

Flags that affect cost and quality:

-n / --predict — max tokens to generate (always set this)
-c / --ctx-size — context window; larger = more RAM, slower
-ngl / --gpu-layers — offload layers to GPU; 99 for full GPU on CUDA/Metal

Server mode (OpenAI-compatible)

llama-server exposes an HTTP API compatible with OpenAI clients:


./llama-server \
  -m models/llama-3.2-8b-instruct.Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99


import OpenAI from "openai";
 
const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "llama",
});

Use server mode when multiple tools (IDE, scripts, agents) share one local model process.

Apple Silicon notes

Metal acceleration on M1/M2/M3/M4 makes llama.cpp the default choice for Mac developers:

Unified memory means “VRAM” is shared with the OS — monitor memory pressure
16 GB RAM: stick to 7B–8B Q4; 32 GB+: 13B–14B Q4/Q5 viable
Thermal throttle on sustained agent loops — set session caps

Metering

llama.cpp does not ship Prometheus metrics. Instrument at the client or proxy:

Metric	How to capture
Tokens in/out	Parse API response `usage` fields or count locally
Wall-clock latency	Middleware on your OpenAI client
CPU/GPU utilization	`top`, Activity Monitor, `nvidia-smi`

Tag spans with inference.backend=llama-cpp for fair comparison against Ollama and cloud APIs.

Workload fit

Profile	Fit
Coding assistants (solo dev)	Strong on Mac/Linux with Q5 8B
Batch / offline at scale	Weak — throughput per dollar lags vLLM on GPU
RAG pipelines	Acceptable for prototypes; trim chunks aggressively
Multi-tenant API	Poor — no production batching layer

Guardrails

Set -n (max tokens) on every CLI invocation
Cap -c (context) per task — do not default to 128K
For shared llama-server, add reverse-proxy rate limits
Hybrid route: local llama.cpp for draft, cloud for production commits

Anti-patterns

Anti-pattern	Why it fails
CPU-only 70B inference for interactive IDE	Latency measured in minutes
Max context on 8 GB RAM Mac	Swap thrashing, system freeze
Skipping quant benchmarks	Q2 models save RAM but fail code tasks
Running llama-server without auth on LAN	Open relay on your network

When to graduate

Signal	Move to
Need `ollama pull` ergonomics	Ollama
Multiple backends, one API URL	LocalAI
>50 concurrent users, GPU cluster	vLLM
Hugging Face enterprise deploy	TGI

Self-hosting — stack comparison
Ollama — llama.cpp with batteries included
Context hygiene — trim before local inference
llama.cpp repository