Tokenminning in Windsurf
Windsurf (Devin Desktop) re-sends context on every Cascade turn: indexed codebase snippets, pinned files, rules, MCP tool schemas, and growing chat history. Most IDE spend comes from long Cascade sessions, frontier models on routine work, and heavy configuration—not from one verbose reply.
Work through the sections below in order. For the general technique stack, see Where to start. For underlying patterns, see Context hygiene, Model routing, and Prompt hygiene.
Quick checklist
- Check the usage meter in Windsurf or your plan page and note daily/weekly quota burn.
- Use SWE-1.5 or other free models for routine Cascade work. Reserve Claude/GPT frontier models for tasks that actually need them.
- Prefer glob and manual rule triggers over
always_on. Trim global rules (6,000-character cap) and workspace rules (12,000 per file). - Disable MCP servers and tools you are not using this week. Cascade has a 100-tool ceiling.
- Start a new Cascade conversation for each task—not one marathon thread.
- Add
.codeiumignoreentries for generated assets, vendored deps, and build output so indexing and retrieval stay lean.
Typical impact when you follow the list: 50–80% quota savings by routing routine work to SWE models; 20–50% on input by trimming rules and MCP; 30–60% less context growth from shorter Cascade sessions. Benchmark on your own usage meter—your mix of Cascade vs Command vs Tab and default models will differ from anyone else’s.
How Windsurf bills a request
Each Cascade turn sends your prompt plus everything Windsurf attaches: open files, RAG-retrieved snippets from the indexed codebase, pinned context, active rules, and conversation history.
Self-serve plans (Pro, Max, Teams) use a quota-based system : a daily and weekly allowance that refreshes automatically. Quota consumption scales with tokens processed, and cost per token varies by model. Short requests with narrow context burn less than long threads over large codebases. Free models like SWE-1.5 do not count against quota at all.
Enterprise may bill in Agent Compute Units (ACUs) or legacy prompt credits depending on contract. Legacy credit plans charge per Cascade message to a premium model (one credit per prompt regardless of how many tool steps follow).
Windsurf also uses prompt caching on frontier models. Follow-up messages in the same conversation with the same model reuse cached context at reduced cost—similar to Cursor’s cache read tokens. Switching models mid-thread loses that cache. See the token pricing example in Windsurf’s quota docs for a worked breakdown (input cache write, cache read, output, tool calls).
What does not burn Cascade quota:
- Command (
Cmd/Ctrl+I) inline edits — no premium credits required - Autocomplete / Tab — separate from Cascade metering
- Auto-generated Memories — creating and retrieving memories does not consume credits (but retrieved memories still add tokens when attached)
1. Measure first
Where to look:
- In-editor usage meter — remaining daily/weekly quota and reset timing
- windsurf.com/subscription/manage-plan — plan details and extra-usage balance
- Cascade Stats for Nerds (stats icon on chat messages) — per-message context statistics
- Plans and Usage — how allowances, extra usage, and enterprise ACUs work
Teams / Enterprise: Analytics in the dashboard, or the Cascade Analytics API for model usage, credit consumption, and tool statistics.
After a heavy Cascade week, check whether quota burned on frontier models, long sessions, or configuration bloat. That tells you which section below to prioritize.
2. Match the model to the task
See AI Models for current rates and availability. This is Windsurf’s version of Model routing: default cheap, escalate only on failure.
Start here:
- Tab / Autocomplete — completions and single-line edits (Autocomplete overview )
- Command (
Cmd/Ctrl+I) — current-file inline generation and edits without premium quota - SWE-1.5 / SWE-1.6 — log checks, grep-style questions, renames, most Cascade agent work (free on quota plans )
- Claude Sonnet / GPT mid-tier — multi-file refactors when SWE models stall
- Claude Opus / GPT frontier / thinking variants — deep debugging or novel design only
Costs more than you expect:
- Frontier third-party models — quota scales with tokens processed; large indexed codebases inflate every turn
- SWE-1.5 Fast / fast Opus variants — higher per-token cost for speed (extra usage pricing )
- Thinking / extended-reasoning models — extra reasoning tokens bill as output
- Switching models mid-conversation — cache miss on the new provider; start a new Cascade chat
Windsurf’s own guidance: stay on one frontier model per task so caching kicks in, and use free SWE models for routine work.
3. Trim what rides along every request
Input bloat in Windsurf usually comes from indexing surface area, pinned context, and rules—not your prompt text alone.
Indexing and retrieval
Windsurf indexes your full local codebase and retrieves snippets via RAG (Context Awareness ). Fast Context uses SWE-grep subagents to retrieve faster with less irrelevant code—but the indexed surface still matters.
- Add a
.codeiumignore(gitignore syntax) fornode_modules/, build output, generated code, and large assets - Enterprise: optional global
~/.codeium/.codeiumignoreapplies across all workspaces (Cascade overview ) - Pin only what retrieval misses—see Context Pinning below
Context pinning and persistent context
The Chat Advanced tab and Cascade Customizations panel hold persistent context:
- Pinned Contexts — files, directories, and snippets always in scope. Pin only what you need ; over-pinning slows responses and inflates tokens
- Custom Chat Instructions — short orientation prompts. Keep them brief
- Active Document — current file gets special focus; close unrelated tabs when possible
For deterministic, minimal context, use @-mentions (Chat overview ): @filename, @function, @directory, @diff. Prefer @my_function over pasting entire files.
Rules
Rules in .devin/rules/ (preferred) or .windsurf/rules/ are injected based on their activation mode :
| Mode | Context cost |
|---|---|
always_on / root AGENTS.md | Every Cascade message |
model_decision | Description always; full rule on demand |
glob | Only when matching files are touched |
manual (@rule-name) | Only when you @mention it |
- Prefer
globandmanualoveralways_on - Global rules cap at 6,000 characters; workspace rules at 12,000 per file
- Do not duplicate the same instructions in rules,
AGENTS.md, and.windsurfrules - Use AGENTS.md for directory-scoped conventions instead of always-on root rules when possible
MCP servers
Each enabled MCP server adds tool schemas to Cascade context—even when no tool is called. Cascade supports up to 100 total tools .
- Disable servers and individual tools you are not using this week
- One narrow, task-specific server beats five overlapping ones
- Config lives in
~/.codeium/windsurf/mcp_config.json(MCP docs )
Memories
Auto-generated Memories do not consume quota, but Cascade retrieves them when relevant—adding tokens. For durable, team-shareable knowledge, write a Rule or AGENTS.md entry instead of asking Cascade to “remember” it.
@ mentions and codebase search
Cascade has Search, Analyze, and Fast Context built in. You rarely need to pin an entire package for a one-line fix.
- Prefer a focused prompt: “fix spacing in
Navbar.tsxonly” - Use
@functionor@filenameinstead of pasting full file contents - Submit with
⌘⏎to force codebase context only when needed (Chat overview )
See Context hygiene for the general just-in-time retrieval pattern.
New conversation per task
Start a new Cascade conversation when you finish one task and begin another, when you switch models, when follow-up turns feel slow or expensive, or when Cascade loops on a stuck problem. Every new message includes prior context—longer threads mean bigger token packages per turn.
4. Write tighter prompts
Windsurf-specific versions of Prompt hygiene and Prompt engineering :
Too broad:
Fix this bug. Also review the whole auth system and suggest improvements.Scoped:
Fix ONLY the null check in auth/login.ts line 42.
No explanations. Max 1 file changed.Batch related fixes in one message instead of five separate Cascade turns. Review diffs before accepting—each rejected revision is another output bill.
5. Set spending guardrails
Windsurf does not enforce your inference budget. You set the limits.
- Glance at the usage meter after heavy Cascade sessions
- Know your plan’s daily and weekly allowance (Quota docs )
- Enable extra-usage caps on your plan page before on-demand API billing kicks in
- Team admins: usage configuration API for per-user add-on caps; MCP whitelist to block unapproved servers
For metering and caps in products you ship, see Article I and Article IV.
Troubleshooting
Quota drains fast on SWE-1.5 — unlikely on current plans (SWE-1.5 is free). Check you are not on a legacy credit plan or using a frontier default.
High token count per message — over-pinned context, always_on rules, or large indexed codebase. Trim pins; switch rules to glob/manual; expand .codeiumignore.
High output — verbose Cascade, thinking model, or many revision cycles. Tighter prompts; SWE model; review before accepting.
Slow retrieval, irrelevant files — indexing too broad or pins too wide. .codeiumignore; narrow pins to the “lowest common denominator” directory (context pinning best practices ).
Spike after enabling MCP — tool schemas attach per turn. Disable unused servers/tools; stay under the 100-tool limit.
Spike after switching models mid-chat — cache miss on new provider. New conversation when switching.
Hit daily cap early — long Cascade threads with Claude/GPT defaults. New conversation per task; route routine work to SWE models.
When Windsurf optimization is not enough
Trimming Windsurf configuration does not fix production agent loops. If customer-facing features dominate spend, instrument with per-feature tags and apply Context hygiene, Prompt caching, and Output and RAG. Narev provides normalized USD across providers if you need cross-provider cost math.