The Proof That Needed a Handoff

Problem 21.10 in the Kourovka Notebook asks whether every finite group admits a “just finite presentation.” The Kourovka Notebook is a running list of open problems in group theory, maintained since 1965, and Problem 21.10 had been open long enough to feel settled into the permanent backlog.

Marc Lackenby, a mathematician at Oxford, used DeepMind’s new AI Co-Mathematician to work on it. The system generated a proof. A dedicated reviewer agent flagged a gap in the argument. Lackenby looked at the gap and realized he knew how to fill it. After his intervention, the system completed the proof.

The answer turned out to be affirmative: yes, every finite group has such a presentation. Whether the AI “solved” the problem or Lackenby did is a question that quickly becomes unproductive. What the story illustrates is the collaboration pattern the paper is designed around: the system handles broad exploration and catches its own errors; the human handles the insight that the machine surfaced but couldn’t resolve.

The Architecture

The system is organized as a hierarchy of asynchronous agents. At the top sits a project coordinator that accepts high-level goals and steers overall research direction. Below it, workstream coordinators manage parallel lines of inquiry — literature review, library development, counterexample search, formal verification. Each workstream can spawn specialized sub-agents: a search agent for pulling relevant papers, a coding agent for computational exploration, and Gemini Deep Think acting as a proof verifier. The whole structure communicates through a shared message system and a persistent file store that preserves both successful results and explicitly labeled dead ends.

That last part matters. Failed explorations aren’t discarded — they’re logged with enough context to be audited. The working document that agents produce carries margin annotations linking claims to supporting computations, so a mathematician reviewing output can trace any step. The system also includes hard constraints preventing agents from declaring success before review cycles are complete; there’s no fast-path to claiming the theorem is proved.

The design is deliberately not a push-button tool. The human researcher can steer mid-task, redirect workstreams, and inject mathematical intuition that the system surfaces as a need. Lackenby’s contribution to the Kourovka problem wasn’t incidental — he provided the key insight that a reviewer agent correctly identified as missing from the machine-generated argument.

What the Numbers Actually Say

The paper evaluates the system on FrontierMath Tier 4, a benchmark of 48 research-level problems designed by Epoch AI to be short-term research projects for professional mathematicians. The AI Co-Mathematician scores 48%, correctly solving 23 of 48 problems — a new high on this tier of the benchmark among all evaluated AI systems.

The base model — Gemini 3.1 Pro evaluated directly — scores 19% on the same benchmark.

That 29-percentage-point gap is where the interesting claim lives. It doesn’t come from a better underlying model; it comes from scaffolding. Parallel workstreams, reviewer agents, iterative refinement, and the persistent audit trail collectively produce roughly 2.5× the performance of the raw model on hard mathematics. This is a large effect from architecture and process, not from model capability. The paper also reports a bespoke internal benchmark of 100 research-level problems with code-checkable answers, where the system outperforms standalone Gemini models by a similar margin.

The system also integrates with AlphaProof and AlphaEvolve for formal verification and algorithmic search — DeepMind’s earlier specialized tools feed into this general-purpose research infrastructure rather than being replaced by it.

What This Is and Isn’t

The paper is careful not to overclaim. The Kourovka story is presented as a case study in the collaboration model, not as evidence that the AI could have solved the problem alone. That framing is honest: the reviewer agent found the gap, but it took Lackenby’s mathematical judgment to close it. The system’s strength is throughput on exploration and reliable error-catching at scale; the human’s value is the leap from a flagged problem to a solution.

This is a tool that makes a certain kind of mathematical work faster: literature search, computational verification, counterexample generation, the bookkeeping of a multi-thread research project. It doesn’t replace the step where a mathematician looks at a gap and sees how to fill it — but it may make that step arrive sooner, more cleanly labeled, with the supporting evidence already assembled.

The 48% on FrontierMath Tier 4 is a striking headline. But the more durable result is probably simpler: an agentic system built around review cycles and explicit failure logging out-performs the base model by a factor of 2.5 on hard math problems, and the performance comes from process design rather than model scale. For anyone building research-grade agents, the architecture is worth reading carefully.