Ollama
Ollama is the fastest path from a downloaded open-weight model to a working local assistant. One binary, ollama pull, and an OpenAI-compatible HTTP API on port 11434. Most teams encounter Ollama through IDE integrations (Zed, Aider, Cline) before they run it in production.
This guide covers when Ollama saves money, how to meter GPU time fairly, and how to wire hybrid routing without quality regressions.
Expected impact
| Scenario | Typical outcome |
|---|---|
| Replace cloud API for exploratory coding | 60–100% savings on that traffic — if GPU is already owned |
| Replace cloud API on dedicated GPU you bought for Ollama | Break-even depends on utilization; idle GPU erodes savings |
| Replace frontier cloud for production user traffic | Often negative — quality and throughput gaps |
Ollama optimizes for developer experience, not maximum tokens per second. Treat it as a dev and exploration stack unless you have measured otherwise.
What Ollama is good at
- Local dev and IDE assistants — Zed, Aider, Continue, Open WebUI
- Quick model swaps —
ollama pull llama3.2and test in minutes - Air-gapped or offline work — no API keys, no egress
- OpenAI-compatible clients — point
baseURLathttp://localhost:11434/v1
What Ollama is not
- A multi-tenant production serving layer with autoscaling and request queuing at datacenter scale
- Continuous batching across hundreds of concurrent users (use vLLM or TGI)
- A substitute for frontier models on hard reasoning without benchmark proof
Meter GPU time, not absence of API bills
Ollama does not send you an invoice. Your cost is hardware depreciation + electricity + opportunity cost of VRAM.
Track per session:
| Metric | Why it matters |
|---|---|
| Wall-clock per request | Slow local inference can cost more engineer time than API tokens |
| GPU utilization % | 7B model on a 24GB card at 15% load wastes headroom |
| VRAM resident models | Each loaded model consumes memory; ollama ps shows what’s loaded |
| Tokens generated | Compare against cloud pricing for the same model tier |
Prohibited: assuming local is free because there is no Stripe charge.
Required: tag spans with inference.backend=ollama and log GPU-seconds alongside token counts.
Model selection
Ollama ships curated model families. Match model size to VRAM and task:
| VRAM | Practical models | Workload fit |
|---|---|---|
| 8 GB | 3B–7B quantized (Q4_K_M) | Completions, quick questions, exploration |
| 16 GB | 7B–13B or 8B full precision | Most IDE assistant tasks |
| 24 GB+ | 14B–32B or 70B quantized | Heavier refactors; still below frontier cloud |
Quantization (Q4, Q5, Q8) trades quality for speed and memory. Benchmark on your prompts before committing — a Q4 model that fails code review costs more in rework than API tokens saved.
# Pull and run
ollama pull llama3.2
ollama run llama3.2
# List loaded models and memory
ollama ps
# Serve OpenAI-compatible API (default :11434)
ollama serveOpenAI-compatible API
Most SDKs accept a custom base URL:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // required by SDK; Ollama ignores it
});
const response = await client.chat.completions.create({
model: "llama3.2",
messages: [{ role: "user", content: "Summarize this function in one line." }],
max_tokens: 128,
});Always set max_tokens. Local models do not enforce output ceilings unless you do.
IDE integration patterns
Zed
Zed supports Ollama as a provider. Configure in assistant settings, route exploration locally, and escalate to cloud mid-tier for commits. See Tokenminning in Zed.
Aider
Add Ollama to .aider.conf.yml:
model: ollama/llama3.2Keep architect mode on cloud if local quality fails on multi-file edits.
Cline / Continue / Open WebUI
Any tool that accepts an OpenAI-compatible endpoint can target http://localhost:11434/v1. Narrow context attachments — local models degrade faster than frontier models on bloated prompts.
Hybrid routing
The highest-ROI pattern for most teams: local for draft, cloud for ship.
Exploration / completions → Ollama (7B–13B)
↓ quality check fails OR commit-bound work
Mid-tier cloud API → production mergeEvery escalation must be logged:
- Cheaper local model attempted
- Failure signal (lint errors, test failures, human rejection)
- Cloud model selected and marginal cost delta
This implements Article II: the routing mandate for hybrid stacks.
Guardrails
| Tier | Enforcement |
|---|---|
| Soft | Log Ollama requests with feature tags; weekly GPU utilization review |
| Medium | Block production routes from calling Ollama; dev keys only |
| Hard | CI rejects agent configs without max_tokens; per-session ceilings on agent loops |
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| Defaulting to local 70B for every IDE turn | Slow feedback loops; engineer time dominates |
| Running Ollama on a laptop battery during long agent sessions | Thermal throttle + latency spikes |
| Skipping benchmarks vs cloud mid-tier | Teams roll back and assume “local doesn’t work” |
| Loading five models simultaneously | VRAM thrashing; unpredictable latency |
| Treating Ollama as production serving for customer traffic | No batching, queuing, or SLO tooling out of the box |
Troubleshooting
OOM / model won’t load — use a smaller quant (Q4_K_M), smaller model, or unload with ollama stop <model>.
Painfully slow — check GPU is actually used (nvidia-smi); CPU-only inference on large models is often slower than a cheap API.
Quality too low for commits — route only exploration locally; keep cloud mid-tier for merge-bound work.
Context too long — trim attachments per Context hygiene; local models have smaller effective context windows than their advertised limits.
When Ollama is not enough
Move to vLLM or TGI when you need:
- Continuous batching across concurrent users
- Kubernetes autoscaling and health probes
- Sub-100ms P95 at hundreds of QPS
- Multi-GPU tensor parallelism
For a unified OpenAI gateway over multiple backends, see LocalAI.
Related
- Self-hosting — when on-prem beats cloud
- Tokenminning in Zed — Ollama in a GPU-rendered IDE
- Model routing — cascade escalation patterns
- Ollama documentation