The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

Read more →

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.

Read more →

No Teacher Required

A new arXiv paper shows that sampling a model at high temperature, filtering outputs that actually run, and SFT-ing on the result lifts Qwen3-30B from 42.4% to 55.3% on LiveCodeBench — no reward model, no external verifier, no teacher model needed.

Read more →

The 2026 Prediction

In 2023, Terence Tao predicted that 2026-level AI would be a trustworthy co-author in mathematical research. This month he credited ChatGPT Pro with a proof in a real analysis paper — and published a philosophical essay arguing AI is a natural extension of humanity's tool-building tradition. Both together are a data point, not a verdict.

Read more →