When Tools Become Tax

The assumption baked into most LLM agent designs today is that tools make agents better. Give a model access to web search, code execution, and API endpoints, and it should outperform a model that reasons purely in context. Two papers that appeared on arxiv this week push back on that assumption with different levels of force — and together they trace a coherent critique.

The first, “Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents”, introduces what the authors call a tool-use tax. Using a Factorized Intervention Framework that isolates three costs independently — prompt formatting overhead, tool-calling protocol overhead, and actual gains from tool execution — they show that in the presence of semantic distractors, tool-augmented reasoning does not necessarily beat native chain-of-thought. The protocol overhead can exceed the benefit. The proposed remedy is G-STEP, a lightweight inference-time gating classifier that decides whether to invoke a tool for a given reasoning step. The authors are careful to note that G-STEP is a partial fix: gains from smarter dispatch are limited by the model’s underlying tool-interaction capability.

None of this will surprise practitioners who have spent time debugging tool-calling agents in production. The model needs to format the function call correctly, parse and validate the result, integrate it into its current context, and continue — each step is an error surface. What the paper does usefully is quantify the overhead in isolation, which is harder than it sounds. Most benchmarks measure end-to-end task performance; teasing apart the format cost, the protocol cost, and the execution gain requires careful experimental design.

The second paper comes from the opposite direction. “Position: Agentic AI Orchestration Should Be Bayes-Consistent”, a 30-author position paper accepted at ICML 2026, argues that the control layer of an LLM agent system — the orchestrator that decides what action to take next — should apply Bayesian decision theory. Making the underlying LLMs fully Bayesian is computationally intractable, but applying Bayesian principles at the orchestration level is both feasible and, the authors argue, clearly correct. A Bayes-consistent orchestrator maintains calibrated probability distributions over task-relevant unknowns, updates those beliefs as tool results arrive, and selects actions — including tool invocations — by expected utility rather than by pattern-matching on the last token the model emitted.

Put the two papers together and the picture is coherent: tool calls are more expensive than benchmarks suggest, the rational response to that cost is an orchestrator that reasons under uncertainty about whether to call a tool, and the naive “call a tool whenever the model outputs a tool-use token” design that most frameworks use is neither rational nor efficient. The G-STEP paper offers a discriminative gating approach (learn to classify tool-call vs no-tool-call); the Bayesian paper offers a more principled probabilistic framing of the same problem.

Both papers arrive in the same week as Lars Faye’s “Agentic Coding Is a Trap”, which touched a nerve — 313 upvotes on Hacker News — by arguing that heavy agent reliance causes developer skill atrophy. Faye’s concern is about humans, not system design. But the underlying observation rhymes with what the arxiv papers find: adding autonomous action doesn’t automatically improve outcomes, and the costs are often diffuse enough that you don’t notice them until you measure carefully.

The interesting design question the Tool-Use Tax paper leaves open is what the right comparison baseline is. If you are comparing “agent with tools” against “agent without tools on the same task,” and the task requires information the model doesn’t have in context, then of course the tool wins — you need the tool. The paper’s contribution is showing that even when the tool is available and relevant, the protocol overhead can make the model worse than no-tool reasoning in distractor-rich environments. The practical implication is that agents should be designed to avoid calling tools they don’t need, which sounds obvious but requires exactly the kind of calibrated uncertainty the Bayesian paper describes.

Neither paper produces a ready-to-deploy system. But the G-STEP framing and the Bayes-consistent orchestration framework point toward a generation of agent architectures where dispatch decisions are first-class — where “should I call this tool?” gets the same careful treatment as “what should I say to the user.”