Ollama – Tokenminning

Ollama is the fastest path from a downloaded open-weight model to a working local assistant. One binary, ollama pull, and an OpenAI-compatible HTTP API on port 11434. Most teams encounter Ollama through IDE integrations (Zed, Aider, Cline) before they run it in production.

This guide covers when Ollama saves money, how to meter GPU time fairly, and how to wire hybrid routing without quality regressions.

Expected impact

Scenario	Typical outcome
Replace cloud API for exploratory coding	60–100% savings on that traffic — if GPU is already owned
Replace cloud API on dedicated GPU you bought for Ollama	Break-even depends on utilization; idle GPU erodes savings
Replace frontier cloud for production user traffic	Often negative — quality and throughput gaps

Ollama optimizes for developer experience, not maximum tokens per second. Treat it as a dev and exploration stack unless you have measured otherwise.

What Ollama is good at

Local dev and IDE assistants — Zed, Aider, Continue, Open WebUI
Quick model swaps — ollama pull llama3.2 and test in minutes
Air-gapped or offline work — no API keys, no egress
OpenAI-compatible clients — point baseURL at http://localhost:11434/v1

What Ollama is not

A multi-tenant production serving layer with autoscaling and request queuing at datacenter scale
Continuous batching across hundreds of concurrent users (use vLLM or TGI)
A substitute for frontier models on hard reasoning without benchmark proof

Meter GPU time, not absence of API bills

Ollama does not send you an invoice. Your cost is hardware depreciation + electricity + opportunity cost of VRAM.

Track per session:

Metric	Why it matters
Wall-clock per request	Slow local inference can cost more engineer time than API tokens
GPU utilization %	7B model on a 24GB card at 15% load wastes headroom
VRAM resident models	Each loaded model consumes memory; `ollama ps` shows what’s loaded
Tokens generated	Compare against cloud pricing for the same model tier

Prohibited: assuming local is free because there is no Stripe charge.

Required: tag spans with inference.backend=ollama and log GPU-seconds alongside token counts.

Model selection

Ollama ships curated model families. Match model size to VRAM and task:

VRAM	Practical models	Workload fit
8 GB	3B–7B quantized (Q4_K_M)	Completions, quick questions, exploration
16 GB	7B–13B or 8B full precision	Most IDE assistant tasks
24 GB+	14B–32B or 70B quantized	Heavier refactors; still below frontier cloud

Quantization (Q4, Q5, Q8) trades quality for speed and memory. Benchmark on your prompts before committing — a Q4 model that fails code review costs more in rework than API tokens saved.


# Pull and run
ollama pull llama3.2
ollama run llama3.2
 
# List loaded models and memory
ollama ps
 
# Serve OpenAI-compatible API (default :11434)
ollama serve

OpenAI-compatible API

Most SDKs accept a custom base URL:


import OpenAI from "openai";
 
const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required by SDK; Ollama ignores it
});
 
const response = await client.chat.completions.create({
  model: "llama3.2",
  messages: [{ role: "user", content: "Summarize this function in one line." }],
  max_tokens: 128,
});

Always set max_tokens. Local models do not enforce output ceilings unless you do.

IDE integration patterns

Zed

Zed supports Ollama as a provider. Configure in assistant settings, route exploration locally, and escalate to cloud mid-tier for commits. See Tokenminning in Zed.

Aider

Add Ollama to .aider.conf.yml:


model: ollama/llama3.2

Keep architect mode on cloud if local quality fails on multi-file edits.

Cline / Continue / Open WebUI

Any tool that accepts an OpenAI-compatible endpoint can target http://localhost:11434/v1. Narrow context attachments — local models degrade faster than frontier models on bloated prompts.

Hybrid routing

The highest-ROI pattern for most teams: local for draft, cloud for ship.


Exploration / completions → Ollama (7B–13B)
        ↓ quality check fails OR commit-bound work
Mid-tier cloud API → production merge

Every escalation must be logged:

Cheaper local model attempted
Failure signal (lint errors, test failures, human rejection)
Cloud model selected and marginal cost delta

This implements Article II: the routing mandate for hybrid stacks.

Guardrails

Tier	Enforcement
Soft	Log Ollama requests with feature tags; weekly GPU utilization review
Medium	Block production routes from calling Ollama; dev keys only
Hard	CI rejects agent configs without `max_tokens`; per-session ceilings on agent loops

Anti-patterns

Anti-pattern	Why it fails
Defaulting to local 70B for every IDE turn	Slow feedback loops; engineer time dominates
Running Ollama on a laptop battery during long agent sessions	Thermal throttle + latency spikes
Skipping benchmarks vs cloud mid-tier	Teams roll back and assume “local doesn’t work”
Loading five models simultaneously	VRAM thrashing; unpredictable latency
Treating Ollama as production serving for customer traffic	No batching, queuing, or SLO tooling out of the box

Troubleshooting

OOM / model won’t load — use a smaller quant (Q4_K_M), smaller model, or unload with ollama stop <model>.

Painfully slow — check GPU is actually used (nvidia-smi); CPU-only inference on large models is often slower than a cheap API.

Quality too low for commits — route only exploration locally; keep cloud mid-tier for merge-bound work.

Context too long — trim attachments per Context hygiene; local models have smaller effective context windows than their advertised limits.

When Ollama is not enough

Move to vLLM or TGI when you need:

Continuous batching across concurrent users
Kubernetes autoscaling and health probes
Sub-100ms P95 at hundreds of QPS
Multi-GPU tensor parallelism

For a unified OpenAI gateway over multiple backends, see LocalAI.

Self-hosting — when on-prem beats cloud
Tokenminning in Zed — Ollama in a GPU-rendered IDE
Model routing — cascade escalation patterns
Ollama documentation