Skip to Content

Ollama

Ollama  is the fastest path from a downloaded open-weight model to a working local assistant. One binary, ollama pull, and an OpenAI-compatible HTTP API on port 11434. Most teams encounter Ollama through IDE integrations (Zed, Aider, Cline) before they run it in production.

This guide covers when Ollama saves money, how to meter GPU time fairly, and how to wire hybrid routing without quality regressions.

Expected impact

ScenarioTypical outcome
Replace cloud API for exploratory coding60–100% savings on that traffic — if GPU is already owned
Replace cloud API on dedicated GPU you bought for OllamaBreak-even depends on utilization; idle GPU erodes savings
Replace frontier cloud for production user trafficOften negative — quality and throughput gaps

Ollama optimizes for developer experience, not maximum tokens per second. Treat it as a dev and exploration stack unless you have measured otherwise.

What Ollama is good at

  • Local dev and IDE assistants — Zed, Aider, Continue, Open WebUI
  • Quick model swapsollama pull llama3.2 and test in minutes
  • Air-gapped or offline work — no API keys, no egress
  • OpenAI-compatible clients — point baseURL at http://localhost:11434/v1

What Ollama is not

  • A multi-tenant production serving layer with autoscaling and request queuing at datacenter scale
  • Continuous batching across hundreds of concurrent users (use vLLM or TGI)
  • A substitute for frontier models on hard reasoning without benchmark proof

Meter GPU time, not absence of API bills

Ollama does not send you an invoice. Your cost is hardware depreciation + electricity + opportunity cost of VRAM.

Track per session:

MetricWhy it matters
Wall-clock per requestSlow local inference can cost more engineer time than API tokens
GPU utilization %7B model on a 24GB card at 15% load wastes headroom
VRAM resident modelsEach loaded model consumes memory; ollama ps shows what’s loaded
Tokens generatedCompare against cloud pricing for the same model tier

Prohibited: assuming local is free because there is no Stripe charge.

Required: tag spans with inference.backend=ollama and log GPU-seconds alongside token counts.

Model selection

Ollama ships curated model families. Match model size to VRAM and task:

VRAMPractical modelsWorkload fit
8 GB3B–7B quantized (Q4_K_M)Completions, quick questions, exploration
16 GB7B–13B or 8B full precisionMost IDE assistant tasks
24 GB+14B–32B or 70B quantizedHeavier refactors; still below frontier cloud

Quantization (Q4, Q5, Q8) trades quality for speed and memory. Benchmark on your prompts before committing — a Q4 model that fails code review costs more in rework than API tokens saved.

# Pull and run ollama pull llama3.2 ollama run llama3.2 # List loaded models and memory ollama ps # Serve OpenAI-compatible API (default :11434) ollama serve

OpenAI-compatible API

Most SDKs accept a custom base URL:

import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:11434/v1", apiKey: "ollama", // required by SDK; Ollama ignores it }); const response = await client.chat.completions.create({ model: "llama3.2", messages: [{ role: "user", content: "Summarize this function in one line." }], max_tokens: 128, });

Always set max_tokens. Local models do not enforce output ceilings unless you do.

IDE integration patterns

Zed

Zed supports Ollama as a provider. Configure in assistant settings, route exploration locally, and escalate to cloud mid-tier for commits. See Tokenminning in Zed.

Aider

Add Ollama to .aider.conf.yml:

model: ollama/llama3.2

Keep architect mode on cloud if local quality fails on multi-file edits.

Cline / Continue / Open WebUI

Any tool that accepts an OpenAI-compatible endpoint can target http://localhost:11434/v1. Narrow context attachments — local models degrade faster than frontier models on bloated prompts.

Hybrid routing

The highest-ROI pattern for most teams: local for draft, cloud for ship.

Exploration / completions → Ollama (7B–13B) ↓ quality check fails OR commit-bound work Mid-tier cloud API → production merge

Every escalation must be logged:

  • Cheaper local model attempted
  • Failure signal (lint errors, test failures, human rejection)
  • Cloud model selected and marginal cost delta

This implements Article II: the routing mandate for hybrid stacks.

Guardrails

TierEnforcement
SoftLog Ollama requests with feature tags; weekly GPU utilization review
MediumBlock production routes from calling Ollama; dev keys only
HardCI rejects agent configs without max_tokens; per-session ceilings on agent loops

Anti-patterns

Anti-patternWhy it fails
Defaulting to local 70B for every IDE turnSlow feedback loops; engineer time dominates
Running Ollama on a laptop battery during long agent sessionsThermal throttle + latency spikes
Skipping benchmarks vs cloud mid-tierTeams roll back and assume “local doesn’t work”
Loading five models simultaneouslyVRAM thrashing; unpredictable latency
Treating Ollama as production serving for customer trafficNo batching, queuing, or SLO tooling out of the box

Troubleshooting

OOM / model won’t load — use a smaller quant (Q4_K_M), smaller model, or unload with ollama stop <model>.

Painfully slow — check GPU is actually used (nvidia-smi); CPU-only inference on large models is often slower than a cheap API.

Quality too low for commits — route only exploration locally; keep cloud mid-tier for merge-bound work.

Context too long — trim attachments per Context hygiene; local models have smaller effective context windows than their advertised limits.

When Ollama is not enough

Move to vLLM or TGI when you need:

  • Continuous batching across concurrent users
  • Kubernetes autoscaling and health probes
  • Sub-100ms P95 at hundreds of QPS
  • Multi-GPU tensor parallelism

For a unified OpenAI gateway over multiple backends, see LocalAI.

Last updated on