The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

Read more →

The Model That Rewrote Its Own Scaffold

MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.

Read more →

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

Read more →

Eight Hours in the Shell

Z.AI released GLM-5.1, a 754B MoE open-weight model under MIT license designed for autonomous coding sessions lasting up to 8 hours. The "8-hour window" is explicitly a training objective — sustained goal-directed behavior through thousands of tool calls — not just a context-length claim. It claims the top spot on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 and Claude Opus 4.6.

Read more →