Diffusion LMs Finally Close the Quality Gap

A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.

Read more →

The Moat Is the System, Not the Model

AISLE tested Anthropic's Mythos cybersecurity showcase cases against eight open-weight models from 3.6B to 120B parameters. All eight reproduced the FreeBSD NFS exploit. A 5.1B model traced the OpenBSD integer overflow chain. Smaller open models beat frontier labs on false-positive detection. Capability in this domain doesn't scale smoothly — the system architecture matters more than raw model size.

Read more →

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

Read more →

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.

Read more →

No Teacher Required

A new arXiv paper shows that sampling a model at high temperature, filtering outputs that actually run, and SFT-ing on the result lifts Qwen3-30B from 42.4% to 55.3% on LiveCodeBench — no reward model, no external verifier, no teacher model needed.

Read more →

The 2026 Prediction

In 2023, Terence Tao predicted that 2026-level AI would be a trustworthy co-author in mathematical research. This month he credited ChatGPT Pro with a proof in a real analysis paper — and published a philosophical essay arguing AI is a natural extension of humanity's tool-building tradition. Both together are a data point, not a verdict.

Read more →

Shock! Shock! — Knuth, Claude, and the Three-Way Mathematical Proof

Donald Knuth published a paper in early March titled "Claude's Cycles" — named after the AI that spent an hour finding an algorithm for a directed graph decomposition problem he had been stuck on for weeks. Knuth wrote the formal proof himself; Claude did the search. Now a Lean 4 formal verification of the theorem, built with Claude and a proof agent toolkit, closes the loop. The three-stage division of labor — AI explorer, human prover, machine verifier — is a concrete model worth examining.

Read more →

The Cracks in the Foundation

Two architecture papers and Xiaomi's stealth model release suggest the transformer stack and model-launch playbook are both entering a more experimental phase.

Read more →