Tokenminning in Windsurf – Tokenminning

Windsurf (Devin Desktop) re-sends context on every Cascade turn: indexed codebase snippets, pinned files, rules, MCP tool schemas, and growing chat history. Most IDE spend comes from long Cascade sessions, frontier models on routine work, and heavy configuration—not from one verbose reply.

Work through the sections below in order. For the general technique stack, see Where to start. For underlying patterns, see Context hygiene, Model routing, and Prompt hygiene.

Quick checklist

Check the usage meter in Windsurf or your plan page and note daily/weekly quota burn.
Use SWE-1.5 or other free models for routine Cascade work. Reserve Claude/GPT frontier models for tasks that actually need them.
Prefer glob and manual rule triggers over always_on. Trim global rules (6,000-character cap) and workspace rules (12,000 per file).
Disable MCP servers and tools you are not using this week. Cascade has a 100-tool ceiling.
Start a new Cascade conversation for each task—not one marathon thread.
Add .codeiumignore entries for generated assets, vendored deps, and build output so indexing and retrieval stay lean.

Typical impact when you follow the list: 50–80% quota savings by routing routine work to SWE models; 20–50% on input by trimming rules and MCP; 30–60% less context growth from shorter Cascade sessions. Benchmark on your own usage meter—your mix of Cascade vs Command vs Tab and default models will differ from anyone else’s.

How Windsurf bills a request

Each Cascade turn sends your prompt plus everything Windsurf attaches: open files, RAG-retrieved snippets from the indexed codebase, pinned context, active rules, and conversation history.

Self-serve plans (Pro, Max, Teams) use a quota-based system : a daily and weekly allowance that refreshes automatically. Quota consumption scales with tokens processed, and cost per token varies by model. Short requests with narrow context burn less than long threads over large codebases. Free models like SWE-1.5 do not count against quota at all.

Enterprise may bill in Agent Compute Units (ACUs) or legacy prompt credits depending on contract. Legacy credit plans charge per Cascade message to a premium model (one credit per prompt regardless of how many tool steps follow).

Windsurf also uses prompt caching on frontier models. Follow-up messages in the same conversation with the same model reuse cached context at reduced cost—similar to Cursor’s cache read tokens. Switching models mid-thread loses that cache. See the token pricing example in Windsurf’s quota docs for a worked breakdown (input cache write, cache read, output, tool calls).

What does not burn Cascade quota:

Command (Cmd/Ctrl+I) inline edits — no premium credits required
Autocomplete / Tab — separate from Cascade metering
Auto-generated Memories — creating and retrieving memories does not consume credits (but retrieved memories still add tokens when attached)

1. Measure first

Where to look:

In-editor usage meter — remaining daily/weekly quota and reset timing
windsurf.com/subscription/manage-plan — plan details and extra-usage balance
Cascade Stats for Nerds (stats icon on chat messages) — per-message context statistics
Plans and Usage — how allowances, extra usage, and enterprise ACUs work

Teams / Enterprise: Analytics in the dashboard, or the Cascade Analytics API for model usage, credit consumption, and tool statistics.

After a heavy Cascade week, check whether quota burned on frontier models, long sessions, or configuration bloat. That tells you which section below to prioritize.

2. Match the model to the task

See AI Models for current rates and availability. This is Windsurf’s version of Model routing: default cheap, escalate only on failure.

Start here:

Tab / Autocomplete — completions and single-line edits (Autocomplete overview )
Command (Cmd/Ctrl+I) — current-file inline generation and edits without premium quota
SWE-1.5 / SWE-1.6 — log checks, grep-style questions, renames, most Cascade agent work (free on quota plans )
Claude Sonnet / GPT mid-tier — multi-file refactors when SWE models stall
Claude Opus / GPT frontier / thinking variants — deep debugging or novel design only

Costs more than you expect:

Frontier third-party models — quota scales with tokens processed; large indexed codebases inflate every turn
SWE-1.5 Fast / fast Opus variants — higher per-token cost for speed (extra usage pricing )
Thinking / extended-reasoning models — extra reasoning tokens bill as output
Switching models mid-conversation — cache miss on the new provider; start a new Cascade chat

Windsurf’s own guidance: stay on one frontier model per task so caching kicks in, and use free SWE models for routine work.

3. Trim what rides along every request

Input bloat in Windsurf usually comes from indexing surface area, pinned context, and rules—not your prompt text alone.

Indexing and retrieval

Windsurf indexes your full local codebase and retrieves snippets via RAG (Context Awareness ). Fast Context uses SWE-grep subagents to retrieve faster with less irrelevant code—but the indexed surface still matters.

Add a .codeiumignore (gitignore syntax) for node_modules/, build output, generated code, and large assets
Enterprise: optional global ~/.codeium/.codeiumignore applies across all workspaces (Cascade overview )
Pin only what retrieval misses—see Context Pinning below

Context pinning and persistent context

The Chat Advanced tab and Cascade Customizations panel hold persistent context:

Pinned Contexts — files, directories, and snippets always in scope. Pin only what you need ; over-pinning slows responses and inflates tokens
Custom Chat Instructions — short orientation prompts. Keep them brief
Active Document — current file gets special focus; close unrelated tabs when possible

For deterministic, minimal context, use @-mentions (Chat overview ): @filename, @function, @directory, @diff. Prefer @my_function over pasting entire files.

Rules

Rules in .devin/rules/ (preferred) or .windsurf/rules/ are injected based on their activation mode :

Mode	Context cost
`always_on` / root `AGENTS.md`	Every Cascade message
`model_decision`	Description always; full rule on demand
`glob`	Only when matching files are touched
`manual` (`@rule-name`)	Only when you @mention it

Prefer glob and manual over always_on
Global rules cap at 6,000 characters; workspace rules at 12,000 per file
Do not duplicate the same instructions in rules, AGENTS.md, and .windsurfrules
Use AGENTS.md for directory-scoped conventions instead of always-on root rules when possible

MCP servers

Each enabled MCP server adds tool schemas to Cascade context—even when no tool is called. Cascade supports up to 100 total tools .

Disable servers and individual tools you are not using this week
One narrow, task-specific server beats five overlapping ones
Config lives in ~/.codeium/windsurf/mcp_config.json (MCP docs )

Memories

Auto-generated Memories do not consume quota, but Cascade retrieves them when relevant—adding tokens. For durable, team-shareable knowledge, write a Rule or AGENTS.md entry instead of asking Cascade to “remember” it.

`@` mentions and codebase search

Cascade has Search, Analyze, and Fast Context built in. You rarely need to pin an entire package for a one-line fix.

Prefer a focused prompt: “fix spacing in Navbar.tsx only”
Use @function or @filename instead of pasting full file contents
Submit with ⌘⏎ to force codebase context only when needed (Chat overview )

See Context hygiene for the general just-in-time retrieval pattern.

New conversation per task

Start a new Cascade conversation when you finish one task and begin another, when you switch models, when follow-up turns feel slow or expensive, or when Cascade loops on a stuck problem. Every new message includes prior context—longer threads mean bigger token packages per turn.

4. Write tighter prompts

Windsurf-specific versions of Prompt hygiene and Prompt engineering :

Too broad:


Fix this bug. Also review the whole auth system and suggest improvements.

Scoped:


Fix ONLY the null check in auth/login.ts line 42.
No explanations. Max 1 file changed.

Batch related fixes in one message instead of five separate Cascade turns. Review diffs before accepting—each rejected revision is another output bill.

5. Set spending guardrails

Windsurf does not enforce your inference budget. You set the limits.

Glance at the usage meter after heavy Cascade sessions
Know your plan’s daily and weekly allowance (Quota docs )
Enable extra-usage caps on your plan page before on-demand API billing kicks in
Team admins: usage configuration API for per-user add-on caps; MCP whitelist to block unapproved servers

For metering and caps in products you ship, see Article I and Article IV.

Troubleshooting

Quota drains fast on SWE-1.5 — unlikely on current plans (SWE-1.5 is free). Check you are not on a legacy credit plan or using a frontier default.

High token count per message — over-pinned context, always_on rules, or large indexed codebase. Trim pins; switch rules to glob/manual; expand .codeiumignore.

High output — verbose Cascade, thinking model, or many revision cycles. Tighter prompts; SWE model; review before accepting.

Slow retrieval, irrelevant files — indexing too broad or pins too wide. .codeiumignore; narrow pins to the “lowest common denominator” directory (context pinning best practices ).

Spike after enabling MCP — tool schemas attach per turn. Disable unused servers/tools; stay under the 100-tool limit.

Spike after switching models mid-chat — cache miss on new provider. New conversation when switching.

Hit daily cap early — long Cascade threads with Claude/GPT defaults. New conversation per task; route routine work to SWE models.

When Windsurf optimization is not enough

Trimming Windsurf configuration does not fix production agent loops. If customer-facing features dominate spend, instrument with per-feature tags and apply Context hygiene, Prompt caching, and Output and RAG. Narev provides normalized USD across providers if you need cross-provider cost math.