LLM Token Economics & Engineering Guides

Tokenminning is the deliberate practice of reducing large language model (LLM) token consumption while preserving useful output quality. Teams treat inference tokens as a finite resource and optimize where spend does not convert to measurable value.

Press coverage, job posts, and internal docs may say token minimizing — searches for that phrase describe the same practice. This wiki standardizes tokenminning as the engineering term for instrumented, attributed cost control: meter usage, attribute spend per feature, and enforce inference budgets in production.

It is the counter-move to tokenmaxxing: maximizing raw AI usage, often tracked on leaderboards by token volume rather than shipped outcomes. When inference bills caught up with that habit, tokenminning became standard engineering practice — meter first, then trim prompts, route models, and cap agent loops.

Read the full definition →

Where should you start?

Follow the branches — every endpoint links to a guide on this wiki.

Each branch optimizes for a different cost model: remove inference entirely · stretch subscription quota · trim per-token API spend.

View paths as plain text

Remove AI entirely: Can you stop using AI? → Yes → Remove LLMs from your stack
Subscription / IDE: Can you stop using AI? → No → How do you pay for tokens? → Subscription → Prompt hygiene → IDE guides → Where to start
API billing: Can you stop using AI? → No → How do you pay for tokens? → API per-token → Metering first → Model routing → Prompt hygiene → Prompt caching → Context compaction → Output and RAG → Semantic caching → Local inference → Self-hosting deep dives

What’s on this wiki

Section	Teaser
Practice	Implementation guides ordered by leverage — metering, prompt hygiene, model routing, caching, and context compression. Each guide includes expected savings ranges and anti-patterns. Start with Where to start.
Constitution	Engineering law for production AI stacks — immutable metering, model routing rules, session caps, prompt lint rules, and CI blocks. Load-bearing guardrails, not style guides.
Manifesto	The philosophical case for treating inference as a scarce resource — why tokenmaxxing fails, what tokenminning advocates, and how unit economics should shape architecture.
Concepts	Definitions for token economics — input vs output tokens, context inflation, model selection, and the full tokenminning definition.
Self-hosting	When on-prem inference beats cloud APIs — GPU metering, workload fit, and deep dives on Ollama, vLLM, llama.cpp, TGI, and LocalAI.
IDEs	Per-editor guides for Cursor, Copilot, Claude Code, Cline, and others — same optimization sequence, different controls for subscription vs API billing.

Docs are also available via MCP and llms.txt for agents and IDE tooling.