The naive approach to scaling a research agent is to run more agents in parallel. Spawn N instances, have each one search independently, merge the results at the end. This seems reasonable until you look at what happens in practice: the agents duplicate effort constantly, each one following similar search threads because they have no awareness of what the others are doing. Worse, the aggregation step becomes unmanageable — after N independent agents each accumulate a full context of search traces and documents, combining their findings requires reading all of it, which blows out the orchestrator’s context window long before you get to synthesis.
Argus (published May 15) addresses this by separating the concerns that naïve parallelism conflates.
Two Jobs, Two Agents
The key architectural move is splitting the system into a Searcher and a Navigator. The Searcher is a standard ReAct agent: it takes a sub-query, conducts web and document searches, and returns evidence traces. There’s nothing unusual here — any trained research agent can be the Searcher. The Navigator is the new piece.
The Navigator maintains an evidence graph — a structured representation of what has been found, what questions remain open, and how pieces of evidence relate to each other. Rather than spawning searchers randomly, the Navigator inspects the current state of the evidence graph, identifies specific gaps (questions that haven’t been answered, claims that need verification, sources that contradict each other), and dispatches Searchers with targeted queries to fill those gaps. When enough evidence is assembled, the Navigator synthesizes a final answer with source tracing.
This is meaningfully different from “run agents in parallel and merge.” The Navigator isn’t aggregating independent reports; it’s conducting a coordinated search where each Searcher’s effort is directed by a shared understanding of what’s already known.
How the Navigator Is Trained
The Navigator’s dispatch-and-synthesis behavior is trained with reinforcement learning, while the Searchers remain standard ReAct agents trained separately. The RL training covers three tasks: verifying whether gathered evidence actually answers the question, deciding which gaps to dispatch and how to formulate the sub-queries, and synthesizing final answers from assembled evidence.
The separation is operationally significant: you can scale the number of Searchers from 1 to 64 without retraining either component. The Navigator adapts its dispatch behavior based on graph state, not on a hardcoded assumption about how many Searchers are available. In practice, both Navigator and Searchers share a 35B-A3B MoE backbone, which allows flexible deployment.
The Context Math
One of the more practical results in the paper is what the Navigator’s context looks like at scale. With 64 parallel Searchers each gathering evidence, the aggregation problem for a simple orchestrator would be enormous — potentially hundreds of thousands of tokens of raw search traces to read. The Navigator never sees those traces directly. It reads the evidence graph, which is a compressed, structured representation of what was found. Even with 64 Searchers, Navigator context stays under 21.5K tokens.
This isn’t just efficient — it’s the mechanism that makes BrowseComp performance possible. BrowseComp is a benchmark that requires tracking down obscure, multi-hop facts across many web sources; it’s designed to be hard precisely because naive retrieval fails on it. Argus reaches 86.2 with 64 Searchers, which the authors report as surpassing every proprietary agent they tested. With a single Searcher, the improvement over the base model is +5.5 points; with 8 Searchers, it’s +12.7 points (average across 8 benchmarks). The gains from parallelism are real, and they compound with the architectural approach rather than washing out due to duplication.
What This Suggests
Argus fits a pattern that’s becoming clearer in multi-agent research: the bottleneck in scaling agent systems usually isn’t the individual agents, it’s the coordination mechanism. An orchestrator that just fans out and aggregates doesn’t know what it doesn’t know. One that maintains structured state about the current research frontier — and uses that state to direct future effort — does.
The evidence graph is essentially a knowledge structure that makes the agent’s uncertainty explicit and actionable. That’s a different framing from “more context” or “more compute”: it’s about representing the shape of an ongoing investigation, not just accumulating information.
Whether this approach generalizes beyond research tasks to other agentic workflows — code review, debugging, analysis — is an open question. The structured-state pattern seems broadly applicable, but the specifics of what goes into the evidence graph, and how the Navigator decides it has enough to synthesize, may be heavily domain-dependent. For deep web research specifically, the Argus architecture seems to have found a configuration that actually scales.
