Tokenminning

What is tokenminning?

Tokenminning is the deliberate practice of reducing large language model (LLM) token consumption while preserving useful output quality. Teams treat inference tokens as a finite resource and optimize where spend does not convert to measurable value.

It is the counter-move to tokenmaxxing: encouraging maximum AI usage, often tracked on leaderboards by raw token volume rather than shipped outcomes.

Tokenminning in one sentence

Match each task to the right model tier, trim bloated prompts and context, cap agent spend, and attribute token cost to features and customers — so AI bills grow with productivity, not with defaults and leaderboards.

Enterprise adoption

In mid-2026, major employers reversed course on volume-first AI culture:

CompanyWhat changed
MetaLimited employee AI use after an exponential cost increase; removed token leaderboards while still planning billions in annual AI spend
UberExhausted its projected AI budget for the year in four months; imposed monthly limits on coding tools
WalmartSet caps on different AI products
AmazonRemoved tokenmaxxing leaderboards alongside Meta

The shift moved tokenminning from niche FinOps practice into mainstream engineering culture. On June 18, 2026, The New York Times covered the reversal in Tech Workers Maxed Out Their A.I. Use. Now They're Trying to Minimize It..

From tokenmaxxing to tokenminning

Earlier in 2026, the message from many tech companies to employees was simple: use as much AI as possible. Engineers at Meta and Amazon even competed on internal leaderboards that ranked who consumed the most tokens.

A token is a unit of LLM usage — roughly a word fragment — that providers bill on. The more tokens your organization burns, the higher the invoice from model vendors. Enterprise contracts combine subscription fees with per-token charges across tens of thousands of workers, so volume-based incentives scaled costs faster than productivity.

When the bills arrived, the tokenmaxxing era ended within months. The reversal underscored a lesson enterprise AI is still learning: volume is not a proxy for value.

Why costs spiraled

Several forces compounded token spend beyond what early budgets assumed.

Frontier models for everything

Newer, more capable models often cost more per token. Anthropic's Fable model, for example, was priced at roughly twice its predecessor, Opus. Many employees had fallen into the habit of reaching for the most powerful model for every task — even when a cheaper one would suffice.

Agentic workflows

Usage patterns shifted from short chat exchanges to agents that work on complex tasks for hours. A simple meeting summary might consume a few hundred tokens; generating code for a new feature can burn tens of thousands. Engineers running agent loops can rack up tens of thousands of dollars in token costs per month.

Context inflation

Longer context windows, multi-step agents, and accumulated tool outputs inflate input token volume on every turn. Per-token prices may look stable while total tokens per request grow 10× or more.

The measurement problem

CEOs who could not measure AI savviness often defaulted to a crude metric: who uses the most tokens? That philosophy promoted volume over efficiency.

Rob May, chief executive of Neurometric and author of The Tokenminning Manifesto, described the trap: leaderboard culture rewarded consumption, not outcomes. The fix is to measure output, not input.

Salesforce's Marc Benioff reported his company still planned to spend hundreds of millions on AI but now tracked "agentic work units" instead of raw tokens — a metric meant to capture shipped functionality, not model calls. Uber's chief operating officer, Andrew Macdonald, put it plainly: without a direct line from AI spend to useful features shipped, the trade is hard to justify.

Tokenminning is the engineering response: treat tokens as capital, attribute them to features and customers, and optimize where spend does not convert to value.

Core principles

Tokenminning is not about using less AI. It is about using AI strategically:

  1. Match model to task. Reserve frontier models for work that genuinely needs them. Route routine classification, summarization, and drafting to mid-tier or smaller models. Teams report savings of 60–90% from model selection alone.
  2. Measure input and output separately. Input tokens often dominate agent workflows because history and tool results are re-sent every turn. Trimming context is frequently higher leverage than shortening responses.
  3. Set hard budgets. Per-session caps, monthly tool limits, and CI-enforced ceilings prevent runaway agent loops — the same enforcement model described in the Tokenminning Constitution.
  4. Tie spend to outcomes. Attribute token cost to product features, customer tiers, or agent runs so finance and engineering share one ledger.

Andy Markus, AT&T's chief AI officer, summarized the practical pattern: use the most powerful models for the tasks that require them, and cheaper models for everything else. "For most use cases," he said, "the latest greatest frontier model isn't needed."

Examples of tokenminning in production

Tokenminning shows up as concrete engineering decisions, not abstract FinOps theory.

Model routing. A support bot classifies tickets with a small, fast model ($0.15 per million input tokens) and escalates only ambiguous cases to a frontier model ($15 per million). The user experience is unchanged for 90% of tickets; average cost per resolution drops sharply.

Context compression. An agent that summarized customer calls used to re-send the full 40-message transcript on every tool call. After summarizing history into a 500-token state object between steps, input volume per session fell from ~80,000 tokens to ~12,000—with no measurable quality regression on evaluation sets.

Prompt hygiene. A team removed ceremonial instructions ("You are a helpful assistant…") and replaced prose formatting rules with JSON schema constraints. One high-volume template shrank from 1,200 input tokens to 340, saving roughly 22% on that feature alone at 8 million requests per month.

Hard ceilings. A coding assistant enforces a 50,000-token monthly cap per engineer with graceful degradation: cheaper model, shorter context, then a clear "budget exhausted" message. Runaway agent loops that previously burned $800 in a weekend now stop at the orchestration layer.

Attribution. Finance tags every inference call with feature, user_id, and session_id. When "document summarization" spikes 4× month-over-month, engineering can trace it to a RAG change—not debate whether "AI costs went up" in the abstract.

Tokenminning in practice

PracticeWhat it avoids
Task-based model routingPaying frontier rates for simple work
Context summarization and truncationContext inflation on every agent turn
Cached and reused system promptsRe-billing identical input on each request
Per-feature cost attributionUnexplained "AI costs went up" line items
Monthly and per-session capsAgent loops that run unbounded

For step-by-step implementation, see Practice — especially where to start, model routing, and context hygiene.

Glossary

TermDefinition
TokenminningDeliberately reducing LLM token consumption while preserving useful output; the named discipline and movement.
Token minimizingDescriptive synonym for tokenminning; common in press and enterprise policy language.
TokenmaxxingMaximizing raw token volume—via leaderboards, unlimited credits, or defaulting to the largest context and frontier models without quality justification.
TokenA billing unit for LLM usage, roughly a word fragment; providers charge per input and output token.
Input tokensEverything sent to the model: prompts, history, retrieved documents, tool schemas. Often dominates agent workloads.
Output tokensEverything the model generates: replies, JSON, tool calls, reasoning text. Usually priced higher per token than input.
Context inflationGrowth in tokens per request over time even when per-token prices stay flat—driven by longer histories, RAG over-fetch, and agent loops.
Frontier modelThe most capable (and typically most expensive) tier from a provider; justified for hard reasoning, not routine tasks.
Model selection / routingChoosing the right model per task; cascade escalation only when cheaper tiers fail quality checks.
Agentic workflowMulti-step AI systems that iterate, call tools, and refine—multiplying token use per user action.
AttributionTagging inference spend by feature, user, session, or customer so cost maps to product surfaces.
FinOps (AI)Financial operations for AI infrastructure: metering, forecasting, unit economics, and budget enforcement.

Frequently asked questions

The glossary above covers most terms. A few common questions:

Is tokenminning the same as "token minimizing"? Yes — tokenminning is the coined term for deliberately using fewer LLM tokens to get the same result. Press coverage often says "token minimizing"; engineering teams use tokenminning when they formalize the practice with routing, metering, and enforcement.

Is it about using less AI? No. The goal is strategic use — same features, lower waste. The floor is not zero tokens; it is the smallest spend that still yields correct results for the task.

How much can it save? Benchmarks vary, but production teams commonly report 60–90% from model selection and routing, with additional gains from context hygiene, prompt hygiene, and prompt caching. Measurement must come first — see Where to start.

Further reading


Tokenminning · Built by Narev