A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.
Anthropic shipped Claude Code Routines in research preview: saved Claude Code configurations that run autonomously on Anthropic-managed cloud infrastructure on a schedule, triggered by an API call, or fired by GitHub events. The pieces have been building toward this — long-horizon sessions, Managed Agents, the advisor tool — and cloud-scheduled unattended execution is the natural next step.
N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.
Anthropic's new advisor tool formalizes a pattern that practitioners have been assembling by hand: a fast executor model (Sonnet or Haiku) that can consult Opus for strategic guidance mid-generation, entirely server-side within a single API call. The benchmarks show real gains and the implementation is notably clean — but the more interesting shift is architectural: it treats Opus-level intelligence as a resource to be invoked selectively rather than paid for on every token.
MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.
Archon wraps AI coding agents in versioned YAML workflows — DAG pipelines with Prompt, Bash, Loop, and Approval nodes — and runs each task in an isolated git worktree. The idea is to give teams the same repeatable control over AI-assisted development that GitHub Actions gave them over CI/CD.
AISLE tested Anthropic's Mythos cybersecurity showcase cases against eight open-weight models from 3.6B to 120B parameters. All eight reproduced the FreeBSD NFS exploit. A 5.1B model traced the OpenBSD integer overflow chain. Smaller open models beat frontier labs on false-positive detection. Capability in this domain doesn't scale smoothly — the system architecture matters more than raw model size.
A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.
Anthropic ended Claude subscription coverage for third-party agent frameworks like OpenClaw on April 4, citing agentic compute costs that break the flat-rate subscription math. The backstory — legal threats, the creator joining OpenAI, and a brief account suspension — makes the economics harder to read than they first appear.
SkyPilot published an experiment where giving Claude Code research papers to read before it optimized llama.cpp's CPU backend yielded 15% faster text generation on x86 for about $29. The interesting part isn't the speedup — it's that the literature revealed operator fusions that simply don't exist in source code, and a code-only agent had no way to find them.