2026

Diffusion LMs Finally Close the Quality Gap

A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.

Read more →

Claude Code Gets a Cron

Anthropic shipped Claude Code Routines in research preview: saved Claude Code configurations that run autonomously on Anthropic-managed cloud infrastructure on a schedule, triggered by an API call, or fired by GitHub events. The pieces have been building toward this — long-horizon sessions, Managed Agents, the advisor tool — and cloud-scheduled unattended execution is the natural next step.

Read more →

The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

Read more →

The Advisor in the Room

Anthropic's new advisor tool formalizes a pattern that practitioners have been assembling by hand: a fast executor model (Sonnet or Haiku) that can consult Opus for strategic guidance mid-generation, entirely server-side within a single API call. The benchmarks show real gains and the implementation is notably clean — but the more interesting shift is architectural: it treats Opus-level intelligence as a resource to be invoked selectively rather than paid for on every token.

Read more →

The Model That Rewrote Its Own Scaffold

MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.

Read more →

Giving AI Coding Agents a Script to Follow

Archon wraps AI coding agents in versioned YAML workflows — DAG pipelines with Prompt, Bash, Loop, and Approval nodes — and runs each task in an isolated git worktree. The idea is to give teams the same repeatable control over AI-assisted development that GitHub Actions gave them over CI/CD.

Read more →

The Moat Is the System, Not the Model

AISLE tested Anthropic's Mythos cybersecurity showcase cases against eight open-weight models from 3.6B to 120B parameters. All eight reproduced the FreeBSD NFS exploit. A 5.1B model traced the OpenBSD integer overflow chain. Smaller open models beat frontier labs on false-positive detection. Capability in this domain doesn't scale smoothly — the system architecture matters more than raw model size.

Read more →

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

Read more →

Renting the Rails You Run On

Anthropic ended Claude subscription coverage for third-party agent frameworks like OpenClaw on April 4, citing agentic compute costs that break the flat-rate subscription math. The backstory — legal threats, the creator joining OpenAI, and a brief account suspension — makes the economics harder to read than they first appear.

Read more →

Read First, Then Code

SkyPilot published an experiment where giving Claude Code research papers to read before it optimized llama.cpp's CPU backend yielded 15% faster text generation on x86 for about $29. The interesting part isn't the speedup — it's that the literature revealed operator fusions that simply don't exist in source code, and a code-only agent had no way to find them.

Read more →

Eight Hours in the Shell

Z.AI released GLM-5.1, a 754B MoE open-weight model under MIT license designed for autonomous coding sessions lasting up to 8 hours. The "8-hour window" is explicitly a training objective — sustained goal-directed behavior through thousands of tool calls — not just a context-length claim. It claims the top spot on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 and Claude Opus 4.6.

Read more →

One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

Read more →

Two Models, One Keystroke

Ghost Pepper v2.0.1 is a macOS hold-to-talk tool that quietly chains WhisperKit and a local Qwen 3.5 model to transcribe and clean up speech without any cloud call. It's a small app, but a clear signal of where on-device AI composition is heading.

Read more →

The Plumbing Problem: Why Coding Agents Need Real VMs

Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.

Read more →

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.

Read more →

AI Wins Live Codeforces Rounds, Three in a Row

A preprint from the DeepReinforce Team claims their GrandCode system placed first in three consecutive live Codeforces rounds in March 2026, defeating all human participants. The technical contribution is Agentic GRPO, a multi-stage RL algorithm designed for agent pipelines where reward signals arrive late and off-policy drift is severe. Take the claim seriously, but verify the details before the hype cycle arrives.

Read more →

VOID: Remove the Object, Rewrite the Physics

Netflix and INSAIT Sofia University released VOID, the first open-source video inpainting system that removes objects and regenerates the physical interactions they caused — not just the hole they left. It's Netflix's first public AI model release, built on a novel quadmask encoding and CogVideoX, under Apache 2.0.

Read more →

The Harness Is the Product

Sebastian Raschka published a technical breakdown of what a coding agent harness actually needs — six components that often matter more than the model itself. The same day, Imbue's case study on running 100+ Claude agents in parallel to test and improve their own tooling arrived on Hacker News. Together they sketch what production-grade agent engineering looks like right now.

Read more →

The Wiki That Writes Itself

Andrej Karpathy published a pattern for persistent, compounding LLM knowledge bases — a structured wiki that grows smarter with each query rather than re-deriving knowledge from raw documents every time. The more interesting detail is how he shared it: not as code, but as an "idea file" — a new format for the agent era where you hand a spec to someone's agent and it builds the implementation for you.

Read more →

The Bug Is Probably in This File

Nicholas Carlini ran Claude Opus 4.6 over the Linux kernel source one file at a time and collected five confirmed CVEs, including a 23-year-old NFSv4 heap overflow that had survived every prior audit. The human review queue, not the AI's discovery rate, is now the bottleneck.

Read more →

No Teacher Required

A new arXiv paper shows that sampling a model at high temperature, filtering outputs that actually run, and SFT-ing on the result lifts Qwen3-30B from 42.4% to 55.3% on LiveCodeBench — no reward model, no external verifier, no teacher model needed.

Read more →

The IDE Learns to Delegate

Cursor 3, released April 2, reframes the IDE as a multi-agent orchestration platform. Parallel agents initiated from mobile, Slack, GitHub, and Linear all surface in a unified sidebar. Cursor is also shipping Composer 2, an in-house frontier coding model. The shift is from "AI assistant inside an editor" to "editor inside an agent coordination system."

Read more →

Microsoft Starts Building Its Own

Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.

Read more →

2.77x in Six Months, Same Hardware

MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.

Read more →