Teaching LLMs to Stop Wasting Tokens

An LLM with tool access is like a junior developer with sudo privileges and zero impulse control. Ask them to find a string in a codebase, and they'll happily read 47 entire files when a single grep would do.

We experienced this firsthand. Our agent was burning through tokens reading every file in sight, even when a quick scan would probably answer the question. The LLM was horribly inefficient. And inefficiency at scale is horribly expensive.

The budget system

Instead of trying to prompt our way out of this (which we tried, didn't work), we added a simple constraint: tools now cost credits.

Every tool has a cost:

grep or list_directory: 1 credit
read_file: 5-10 credits depending on size
parse_ast: 15 credits
run_tests: 20 credits

The LLM starts each task with a budget of about 100 credits. Every tool call decrements it. When the budget runs low, it has to think harder about what it actually needs.

It became even more useful (and cool) when we gave it an extend_budget tool. It can request more credits, but it has to justify why and by how much. No handwaving. Specific reasoning or the request gets denied.

Example:

Tool: extend_budget
Reasoning: "Need to read 3 config files (30 credits) to trace the authentication flow 
across modules. Current approach of grepping found the entry point but not the full chain."
Requested: 30 credits

The results

45% reduction in tool calls across 100+ code reviews. Zero drop in accuracy. The LLM just got more strategic. It greps before reading. It lists directories before parsing. It "thinks".

The budget extensions are fascinating too. About 12% of tasks request them, and 90% of those requests are legitimate (complex refactoring reviews, architectural analysis). We're now using the 10% that aren't to tune the default budget.

What didn't work

We tried prompt engineering first: "Only use tools when necessary," "Prefer cheaper tools," "Think before reading files." The LLM nodded politely and kept reading everything.

We tried fixed limits: "Maximum 10 tool calls per task." It hit the limit every time, even on trivial reviews.

LLMs just respond better to scarce resources than vague instructions. Who knew? (Anyone who's worked with humans, probably.)

What's next

We're experimenting with dynamic budgets based on task complexity. Give bigger credit allocations to large or complex PRs.

If you're building agents with tool access and watching your bills climb, try adding a budget. It's surprisingly effective at teaching LLMs to be less... enthusiastic.

Building CodeReviewr, an AI code review tool that charges per token instead of per developer. Turns out we're pretty motivated to optimize token usage.