What is tokenminning?
Tokenminning is the deliberate practice of reducing large language model (LLM) token consumption while preserving useful output quality. Teams treat inference tokens as a finite resource and optimize where spend does not convert to measurable value.
It is the counter-move to tokenmaxxing: encouraging maximum AI usage, often tracked on leaderboards by raw token volume rather than shipped outcomes.
Tokenminning in one sentence
Match each task to the right model tier, trim bloated prompts and context, cap agent spend, and attribute token cost to features and customers — so AI bills grow with productivity, not with defaults and leaderboards.
Enterprise adoption
In mid-2026, major employers reversed course on volume-first AI culture:
| Company | What changed |
|---|---|
| Meta | Limited employee AI use after an exponential cost increase; removed token leaderboards while still planning billions in annual AI spend |
| Uber | Exhausted its projected AI budget for the year in four months; imposed monthly limits on coding tools |
| Walmart | Set caps on different AI products |
| Amazon | Removed tokenmaxxing leaderboards alongside Meta |
The shift moved tokenminning from niche FinOps practice into mainstream engineering culture. On June 18, 2026, The New York Times covered the reversal in Tech Workers Maxed Out Their A.I. Use. Now They're Trying to Minimize It..
From tokenmaxxing to tokenminning
Earlier in 2026, the message from many tech companies to employees was simple: use as much AI as possible. Engineers at Meta and Amazon even competed on internal leaderboards that ranked who consumed the most tokens.
A token is a unit of LLM usage — roughly a word fragment — that providers bill on. The more tokens your organization burns, the higher the invoice from model vendors. Enterprise contracts combine subscription fees with per-token charges across tens of thousands of workers, so volume-based incentives scaled costs faster than productivity.
When the bills arrived, the tokenmaxxing era ended within months. The reversal underscored a lesson enterprise AI is still learning: volume is not a proxy for value.
Why costs spiraled
Several forces compounded token spend beyond what early budgets assumed.
Frontier models for everything
Newer, more capable models often cost more per token. Anthropic's Fable model, for example, was priced at roughly twice its predecessor, Opus. Many employees had fallen into the habit of reaching for the most powerful model for every task — even when a cheaper one would suffice.
Agentic workflows
Usage patterns shifted from short chat exchanges to agents that work on complex tasks for hours. A simple meeting summary might consume a few hundred tokens; generating code for a new feature can burn tens of thousands. Engineers running agent loops can rack up tens of thousands of dollars in token costs per month.
Context inflation
Longer context windows, multi-step agents, and accumulated tool outputs inflate input token volume on every turn. Per-token prices may look stable while total tokens per request grow 10× or more.
The measurement problem
CEOs who could not measure AI savviness often defaulted to a crude metric: who uses the most tokens? That philosophy promoted volume over efficiency.
Rob May, chief executive of Neurometric and author of The Tokenminning Manifesto, described the trap: leaderboard culture rewarded consumption, not outcomes. The fix is to measure output, not input.
Salesforce's Marc Benioff reported his company still planned to spend hundreds of millions on AI but now tracked "agentic work units" instead of raw tokens — a metric meant to capture shipped functionality, not model calls. Uber's chief operating officer, Andrew Macdonald, put it plainly: without a direct line from AI spend to useful features shipped, the trade is hard to justify.
Tokenminning is the engineering response: treat tokens as capital, attribute them to features and customers, and optimize where spend does not convert to value.
Core principles
Tokenminning is not about using less AI. It is about using AI strategically:
- Match model to task. Reserve frontier models for work that genuinely needs them. Route routine classification, summarization, and drafting to mid-tier or smaller models. Teams report savings of 60–90% from model selection alone.
- Measure input and output separately. Input tokens often dominate agent workflows because history and tool results are re-sent every turn. Trimming context is frequently higher leverage than shortening responses.
- Set hard budgets. Per-session caps, monthly tool limits, and CI-enforced ceilings prevent runaway agent loops — the same enforcement model described in the Tokenminning Constitution.
- Tie spend to outcomes. Attribute token cost to product features, customer tiers, or agent runs so finance and engineering share one ledger.
Andy Markus, AT&T's chief AI officer, summarized the practical pattern: use the most powerful models for the tasks that require them, and cheaper models for everything else. "For most use cases," he said, "the latest greatest frontier model isn't needed."
Examples of tokenminning in production
Tokenminning shows up as concrete engineering decisions, not abstract FinOps theory.
Model routing. A support bot classifies tickets with a small, fast model ($0.15 per million input tokens) and escalates only ambiguous cases to a frontier model ($15 per million). The user experience is unchanged for 90% of tickets; average cost per resolution drops sharply.
Context compression. An agent that summarized customer calls used to re-send the full 40-message transcript on every tool call. After summarizing history into a 500-token state object between steps, input volume per session fell from ~80,000 tokens to ~12,000—with no measurable quality regression on evaluation sets.
Prompt hygiene. A team removed ceremonial instructions ("You are a helpful assistant…") and replaced prose formatting rules with JSON schema constraints. One high-volume template shrank from 1,200 input tokens to 340, saving roughly 22% on that feature alone at 8 million requests per month.
Hard ceilings. A coding assistant enforces a 50,000-token monthly cap per engineer with graceful degradation: cheaper model, shorter context, then a clear "budget exhausted" message. Runaway agent loops that previously burned $800 in a weekend now stop at the orchestration layer.
Attribution. Finance tags every inference call with feature, user_id, and session_id. When "document summarization" spikes 4× month-over-month, engineering can trace it to a RAG change—not debate whether "AI costs went up" in the abstract.
Tokenminning in practice
| Practice | What it avoids |
|---|---|
| Task-based model routing | Paying frontier rates for simple work |
| Context summarization and truncation | Context inflation on every agent turn |
| Cached and reused system prompts | Re-billing identical input on each request |
| Per-feature cost attribution | Unexplained "AI costs went up" line items |
| Monthly and per-session caps | Agent loops that run unbounded |
For step-by-step implementation, see Practice — especially where to start, model routing, and context hygiene.
Glossary
| Term | Definition |
|---|---|
| Tokenminning | Deliberately reducing LLM token consumption while preserving useful output; the named discipline and movement. |
| Token minimizing | Descriptive synonym for tokenminning; common in press and enterprise policy language. |
| Tokenmaxxing | Maximizing raw token volume—via leaderboards, unlimited credits, or defaulting to the largest context and frontier models without quality justification. |
| Token | A billing unit for LLM usage, roughly a word fragment; providers charge per input and output token. |
| Input tokens | Everything sent to the model: prompts, history, retrieved documents, tool schemas. Often dominates agent workloads. |
| Output tokens | Everything the model generates: replies, JSON, tool calls, reasoning text. Usually priced higher per token than input. |
| Context inflation | Growth in tokens per request over time even when per-token prices stay flat—driven by longer histories, RAG over-fetch, and agent loops. |
| Frontier model | The most capable (and typically most expensive) tier from a provider; justified for hard reasoning, not routine tasks. |
| Model selection / routing | Choosing the right model per task; cascade escalation only when cheaper tiers fail quality checks. |
| Agentic workflow | Multi-step AI systems that iterate, call tools, and refine—multiplying token use per user action. |
| Attribution | Tagging inference spend by feature, user, session, or customer so cost maps to product surfaces. |
| FinOps (AI) | Financial operations for AI infrastructure: metering, forecasting, unit economics, and budget enforcement. |
Frequently asked questions
The glossary above covers most terms. A few common questions:
Is tokenminning the same as "token minimizing"? Yes — tokenminning is the coined term for deliberately using fewer LLM tokens to get the same result. Press coverage often says "token minimizing"; engineering teams use tokenminning when they formalize the practice with routing, metering, and enforcement.
Is it about using less AI? No. The goal is strategic use — same features, lower waste. The floor is not zero tokens; it is the smallest spend that still yields correct results for the task.
How much can it save? Benchmarks vary, but production teams commonly report 60–90% from model selection and routing, with additional gains from context hygiene, prompt hygiene, and prompt caching. Measurement must come first — see Where to start.
Further reading
- The Constitution — engineering law for production AI cost control
- The Manifesto — the philosophical case for tokenminning
- Where to start — the correct order for minimizing token usage
- Practice guides — actionable techniques ordered by leverage