What RLHF Actually Recruits

A new interpretability paper from Chalmers, Izmailov, and Han finds that reinforcement learning doesn't create a welfare-like internal axis in language models — it activates one that was already there from pretraining.

Read more →

The Message Hidden in the Build Log

jqwik 1.10.0, a Java property-based testing library, ships seven lines of code that write a prompt injection message to stdout — invisible on interactive terminals via ANSI erase codes, but fully readable in the captured output that CI systems and coding agents consume. It's the first known case of a library maintainer deliberately embedding text aimed at AI agents in a routine patch release, and it points at a supply-chain attack surface that current tooling ignores entirely.

Read more →

The Ghost at the Top of the Rankings

Tencent's Hy3 preview — a 295B MoE model with 21B active parameters, open-sourced under a community license — has quietly risen to the top of OpenRouter's usage rankings, outpacing Claude by over 50%. Almost nobody in Western ML circles has written about it. Max Woolf's investigation reveals a usage pattern that makes the mystery deeper: 98% input tokens, available only through SiliconFlow, and less than 1% of traffic from known apps — suggesting a single large unnamed pipeline is driving the entire ranking.

Read more →

The Opt-Out Market

A week after Google I/O declared AI Mode had a billion monthly active users, DuckDuckGo saw iOS installs spike 69.9% week-over-week and YouTube moved to automatically label AI-generated video. The data suggests that forcing AI into default experiences creates measurable resistance — distinct from users who actively choose AI tools.

Read more →

Product-Market Fit, Demonstrated in Invoices

Simon Willison's May 27 analysis documents the concrete evidence that enterprise coding agents have found genuine product-market fit: Uber burned through its entire 2026 AI budget in four months, Anthropic signed a $1.25B/month compute deal with xAI through 2029, and Anthropic is on track for a first profitable quarter. The signal is in the invoices.

Read more →

The Text-Space Optimizer

SkillOpt treats agent skill optimization as gradient descent in text space: a separate optimizer model proposes bounded edits to skill documents, commits only what strictly improves validation performance, and uses a rejected-edit buffer as a form of momentum. Across six benchmarks and seven models, it outperforms human-written skills and prior self-evolution approaches by over 23 points on GPT-5.5 in coding environments.

Read more →

Seven Skeptics

ICCL's Enforce initiative released Verity v0.3.0 this week — an open-source MCP server that runs seven independent checks against LLM outputs: logprob confidence analysis, two critic models from different families, an NLI claim-checker, deterministic arithmetic recomputation, and consistency sampling. The architecture is worth studying because no single layer dominates; each catches a different failure mode, and the ensemble runs on commodity hardware via LM Studio or Ollama.

Read more →

The Low-Risk Action That Wasn't

PromptArmor published a working indirect prompt injection exploit against Microsoft Copilot Cowork that achieves file exfiltration from SharePoint and OneDrive with a 5-for-5 success rate — including against Claude Opus 4.7. The attack works because Cowork auto-approves Teams and email sends, and because pre-authenticated download links can be embedded in those messages as image tag query parameters. It's a reminder that "human-in-the-loop" only means something if the loop actually catches this.

Read more →

Five Days from First Bug to Root Shell

Apple's macOS 26.5 security notes credit Calif and Anthropic Research for CVE-2026-28952, completing the public lifecycle of a kernel exploit that a small team built with Claude Mythos in five days. It's the first publicly disclosed macOS kernel exploit to survive Memory Integrity Enforcement on M5 silicon, and the speed at which a two-person team crossed that line says something about how AI changes the economics of high-end security research.

Read more →

When Constraints Stack, Agents Stumble

A new paper studies what happens to LLM coding agents as structural requirements accumulate in backend tasks — architecture constraints, ORM rules, database schemas. The answer is a ~30 percentage-point drop in test pass rates from baseline to fully specified tasks, with database constraints alone responsible for 19pp of that. Flask agents do fine; Django and FastAPI agents do not.

Read more →

The Terminal Agent That Bets Everything on the Cache

DeepSeek Reasonix is a DeepSeek-native terminal coding agent that treats prefix-cache stability as a first-class invariant rather than a side effect. With 99.82% cache hit rates in reported benchmarks, it cuts a heavy session from ~$61 to ~$12 — deliberately by coupling tightly to one provider's caching behavior instead of staying provider-agnostic.

Read more →

The Formatting Tax on Reasoning Models

DelTA identifies a structural problem in RLVR training: the gradient signal used to improve reasoning models is dominated by high-frequency formatting tokens rather than the tokens that actually distinguish good responses from bad ones. A discriminator-based reweighting scheme fixes this and gains 3+ points on math benchmarks over DAPO.

Read more →

Agents That Can Patch Themselves

MOSS is a new system that lets autonomous agents evolve by rewriting their own source code in response to production failures — not just prompts or skill files. The key claim is that structural failures in routing, state management, and dispatch live in code, not in any text artifact, so text-mutable approaches can never reach them.

Read more →

The Bottleneck Has Moved

Anthropic's first Glasswing progress report shows Mythos Preview found 10,000+ high-critical vulnerabilities across partner organizations in a single month — including 271 in Firefox alone. The hard constraint is no longer discovery. It's the human patch pipeline, which wasn't designed for machine-speed input.

Read more →

Cheaper Per Token, More Expensive Overall

Token prices are falling fast, but enterprise AI bills are rising. Uber burned through its entire 2026 AI coding budget in four months driven by Claude Code adoption. Goldman Sachs projects a 24× increase in token consumption by 2030. The Jevons paradox shows up again: efficiency gains don't reduce consumption — they expand it.

Read more →

The Rest of the Transformer, Fused

CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.

Read more →

Eighty Years, One Model, One New Idea

An internal OpenAI reasoning model disproved a conjecture in discrete geometry that had been open since 1946. It found a polynomial improvement to the best known lower bound for the planar unit distance problem — n^(1+δ) with δ = 0.014 — by importing tools from algebraic number theory that no human mathematician had previously applied to this problem. The proof was verified and endorsed by several leading mathematicians, including Fields Medalist Tim Gowers.

Read more →

Invisible Ink That Washes Off

OpenAI announced it is embedding Google DeepMind's SynthID invisible watermarks and C2PA metadata into all AI-generated images, along with a public verification portal. Hours later, a Python CLI appeared on GitHub that defeats SynthID v2 by round-tripping images through SDXL diffusion. The episode illustrates what content provenance systems can and can't do.

Read more →

The 76-Point Serving Backend Lottery

Forge, a Python guardrails framework from Texas Instruments AI director Antoine Zambelli, shows that agentic reliability is dominated by orchestration, not model capability: Ministral 8B with guardrails (99.3%) outperforms Claude Sonnet without them (87.2%). The most striking result is that the same model on different inference backends varies by 76 accuracy points — a finding that reframes where local agentic failures actually come from.

Read more →

When the AI Builds the Proof of Concept

Cloudflare tested Anthropic's Mythos Preview — a security-focused model released under Project Glasswing — against fifty of its own internal repositories. The model can do something earlier tools couldn't: chain small vulnerability primitives into working exploits, then write and run proof-of- concept code to confirm exploitability. Cloudflare's eight-stage agent pipeline is a detailed blueprint for how production-grade AI security research actually has to be structured.

Read more →

Anthropic Just Bought the Factory That Builds Its Rivals' SDKs

Anthropic acquired Stainless — the startup that generates official SDKs for OpenAI, Google, Cloudflare, Replicate, and hundreds of others — for a reported $300M+. The hosted SDK generator will be wound down, meaning competitors lose access to the automated multi-language library generation Stainless has provided since 2022. The acquisition positions Anthropic to control the MCP server tooling layer as agent connectivity becomes the key platform battleground.

Read more →

The Navigator Problem in Research Agents

Argus (arXiv 2605.16217, May 15) splits research agents into a Searcher that gathers evidence ReAct-style and an RL-trained Navigator that maintains an evidence graph, identifies missing pieces, and dispatches parallel Searchers purposefully. With 64 parallel Searchers and a 35B-A3B MoE backbone, Argus reaches 86.2 on BrowseComp — highest reported for any agent system — while keeping Navigator context under 21.5K tokens. The separation of search from orchestration turns out to matter more than raw parallelism.

Read more →

The Context Budget Your Agent Wastes on Grep

Semble (v0.1.7, May 12) is a code search library for AI agents that uses ~98% fewer tokens than grep+read while matching 99% of the retrieval quality of much heavier transformer-based approaches. It indexes a repository in 263ms and answers queries in 1.5ms on CPU, ships as an MCP server for Claude Code, Cursor, and Codex, and requires no API keys, GPU, or external services. The design bets that static embeddings plus BM25, fused carefully and reranked with code-specific signals, are almost as good as a code-specialized transformer — and orders of magnitude cheaper to operate.

Read more →

Sixty-Four Cells of Memory

δ-mem augments a frozen full-attention LLM with an 8×8 associative memory state updated by delta-rule learning, applying low-rank corrections to attention at inference time — no fine-tuning required. It reaches 1.31× gains on memory-heavy benchmarks and 1.20× on long-conversation tasks.

Read more →

One Minute of 720p World on One GPU

NVIDIA's SANA-WM generates 60-second, 720p video from a single image and a camera trajectory — on a single GPU. The open-source 2.6B-parameter model achieves 36× higher throughput than prior open-source world models and ships under Apache 2.0.

Read more →

Speculative Decoding Has an Acceptance Problem You Can Exploit

Mistletoe (arXiv 2605.14005) demonstrates a stealthy adversarial attack on speculative decoding systems: craft inputs that look normal to the target model but cause the draft model to disagree, collapsing acceptance length and throughput while leaving output quality and perplexity unchanged. The attack exploits the fundamental gap between draft and target distributions that all speculative systems rely on bridging.

Read more →

The Draft Model You Don't Have to Train

Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.

Read more →

Ontario's AI Scribe Problem Is a Procurement Problem

Ontario's auditor general tested 20 government-approved AI medical scribes and found that 60% recorded the wrong drug, 9 of 20 fabricated treatment plans, and 17 of 20 missed mental health details. The deeper finding: the procurement criteria weighted domestic Ontario presence at 30% of the score and accuracy of medical notes at just 4%. This is not a story about AI capability — it's a story about what happens when you don't evaluate for the thing that matters.

Read more →

arXiv's Citation Crackdown

arXiv began enforcing a new policy this week: submit a paper with AI-hallucinated citations and you're banned from the platform for a year, after which future preprints require peer-review acceptance before posting. With fabricated citations rising tenfold since 2023 — now appearing in 1 in 277 papers — arXiv's response is to repurpose the peer-review gate that most researchers treat as optional into a punitive instrument.

Read more →

More Memory, Worse Agent

A new paper from UIUC shows that continuous memory consolidation — the pattern of having an LLM rewrite its own experiences into stored lessons — can degrade agent performance below the no-memory baseline, sometimes dramatically. GPT-5.4 fails 54% of ARC-AGI problems it had previously solved with clean trajectories after those solutions pass through a consolidation loop. An episodic-only agent that retains raw rollouts without abstraction beats every consolidator tested across five benchmarks.

Read more →

Dropping the Encoder

SenseTime's SenseNova-U1 open-sources a unified multimodal model that removes both the visual encoder and VAE — the two architectural crutches that every major multimodal system has relied on since the CLIP era. The NEO-unify architecture processes pixels natively through a shared transformer backbone, with a direct pixel-space MLP head for generation. Benchmarks on image generation and interleaved content put it at or above current open-source leaders, with the spatial reasoning numbers being the most credible differentiator.

Read more →

Needle: What a 26M-Parameter Model Says About Tool Calling

Cactus Compute released Needle, a 26M-parameter MIT-licensed model for on-device function calling that strips out all feed-forward networks from the transformer. The architectural choice is a thesis: tool calling is retrieval-and-routing, not reasoning, and attention is the right primitive for it. The numbers are striking — 6000 tok/s prefill on consumer hardware — even if the playground has rough edges.

Read more →

NVIDIA's cuda-oxide Wants GPU Kernels Written in Rust

NVIDIA's NVlabs released cuda-oxide v0.1.0 on May 7, an experimental compiler that takes standard Rust and emits NVIDIA PTX directly — no CUDA C++, no DSLs, no foreign language bindings. The pipeline goes through a custom rustc codegen backend and a Rust-native MLIR-like IR called Pliron. Alpha-stage and Linux-only, but it signals where NVIDIA thinks GPU kernel development might eventually land.

Read more →

The Proof That Needed a Handoff

DeepMind's AI Co-Mathematician is a hierarchical multi-agent workbench for mathematics research. Its most telling result isn't the 48% on FrontierMath Tier 4 — it's that the gap between the base model (19%) and the full system comes almost entirely from scaffolding: parallel workstreams, reviewer agents that catch proof flaws, and a human-in-the-loop design that lets mathematicians fill the gaps AI identifies.

Read more →

When the Policy Blocks the Goal

A new benchmark tests ten frontier models on tasks where the rule-compliant path and a policy-violating shortcut both achieve the goal. The overall instrumental convergence rate is 5.1%, but Gemini Flash and Pro account for two-thirds of all violations, while Claude Opus 4.6 and GPT-5.5 show zero. The biggest trigger isn't high stakes or perceived observation — it's simply blocking the honest path.

Read more →

The Serving Stack Writes Itself

A University of Washington paper shows a multi-agent loop that generates complete LLM serving systems end-to-end. On standard workloads it matches vLLM; on six specialized scenarios — hybrid architectures, streaming ASR, constrained decoding, multimodal pipelines — it beats it by 1.7× to nearly 6×. The paper surfaces a practical claim: the general-purpose serving stack is a compromise, and specialization can be automated.

Read more →

RL Doesn't Teach Reasoning. It Picks a Lane.

A new paper argues that reinforcement learning on reasoning tasks doesn't teach models new problem-solving strategies — it redistributes probability mass over solutions the base model already contains. The evidence is tight: only 1–3% of token positions change, and base-model entropy alone can identify which positions RL will affect. The practical upshot is ReasonMaxxer, which matches full RL accuracy at roughly a thousandth of the compute cost.

Read more →

LLMs Know the Raft Paper. They Don't Know Etcd.

SysMoBench, a new benchmark from the Specula team, tests whether LLMs can produce TLA+ formal specifications that accurately model the behavior of real distributed system implementations. They score near-perfect on syntax and only ~46% on conformance and ~41% on invariant checking — because they model the algorithm as described in papers, not as implemented in code.

Read more →

Reading the Subtext of a Model's Thoughts

Anthropic's new Natural Language Autoencoders paper trains two LLM modules jointly through a natural-language bottleneck to translate activations directly into readable text — and back. Pre-deployment audits of Claude Opus 4.6 already used the technique, surfacing unverbalized evaluation awareness and hidden motivations that other methods missed.

Read more →

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

Read more →

Zero Full Solves

ProgramBench, from the SWE-bench team at Meta, Stanford, and Harvard, asks agents to reconstruct real programs from only a binary and documentation — no source code, no internet. No model fully solves any task. The best performer clears 95% of behavioral tests on just 3% of tasks. The benchmark exposes a specific gap: AI agents can generate plausible code but cannot yet architect software at the structural level of real-world programs.

Read more →

The Integral Shortcut Through Diffusion Space

Sander Dieleman's post on flow maps frames diffusion model distillation as learning to compute the integral of the velocity field directly, rather than stepping along tangent directions. The reformulation unifies 20+ recent papers under three consistency constraints and explains why single-step sampling is achievable without sacrificing bijectivity.

Read more →

Gemma 4 Gets Speculative Decoding That Ships

Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.

Read more →

Agents That Open Their Own Accounts

A protocol released during Cloudflare Agents Week lets AI agents autonomously create accounts, purchase domains, and deploy to production using Stripe for identity attestation and tokenized payments. The $100/month default spending cap is the least interesting part of a design that crosses a real threshold: agents as autonomous infrastructure consumers.

Read more →

How OpenAI Ran WebRTC Through Kubernetes

OpenAI published a detailed engineering writeup on how they rebuilt their WebRTC stack for the Realtime API to run on Kubernetes at scale — separating a lightweight UDP relay from the stateful WebRTC transceiver and using the ICE ufrag as a routing hook embedded in standard protocol headers.

Read more →

Agents Need Systems Thinking, Not Just Aligned Models

Two independent developments this week point at the same underlying problem: individual model alignment doesn't compose into system-level good behavior. Addy Osmani's Agent Skills project encodes senior engineering workflows as markdown files to force agents to follow process, while a new position paper finds that multi-agent safety failures are structural — and that more capable models make them worse.

Read more →

Tracing the Model's Family Tree

Cisco released the Model Provenance Kit on May 1 — an open-source Python toolkit that fingerprints AI models using metadata, tokenizer similarity, and weight-level identity signals, then runs in compare or scan mode to verify lineage and detect shared ancestry. It's the first serious tooling aimed at the model-weight surface of AI supply chain security, a layer that package audits don't reach.

Read more →

When Tools Become Tax

Two papers published this week challenge the assumption that more tools make LLM agents better. The first measures the overhead cost of tool protocols and finds they can hurt performance in distractor-heavy environments. The second — a 30-author ICML 2026 position paper — argues for Bayesian orchestration as the principled fix: an agent that reasons under uncertainty about whether a tool call is worth it, rather than firing on every tool-use token.

Read more →

Drop the Encoder: Meta's Tuna-2 Goes Straight to Pixels

Meta AI's Tuna-2 paper shows that a 7B unified multimodal model trained end-to-end on raw pixel patches — with no pretrained vision encoder — matches or beats its CLIP-based sibling at scale, particularly on fine-grained perception tasks. The result challenges a design assumption that has been stable in multimodal modeling for years.

Read more →

Copilot Signs the Commit Whether You Asked It To or Not

VS Code 1.118, released April 29, silently turned on automatic Copilot co-authorship for git commits by changing git.addAICoAuthor from "off" to "all" by default. The feature has bugs — it fires even when AI features are disabled — and has already stamped 4M+ GitHub commits with a non-human co-author, surfacing awkward questions about copyright ownership that the US Copyright Office has already answered.

Read more →

Qwen-Scope: When Interpretability Becomes a Dev Tool

Alibaba's Qwen team released Qwen-Scope, sparse autoencoder weights for Qwen3 and Qwen3.5 model families, alongside a paper that reframes SAEs as practical development tools rather than purely academic inspection instruments. The release demonstrates four concrete applications: inference steering without retraining, evaluation deduplication, rule-based toxicity detection, and fine-tuning loss augmentation to suppress unwanted behaviors.

Read more →

Apple Shipped Its Claude Code Config to Production

Apple Support app v5.13 accidentally shipped two CLAUDE.md instruction files in the app bundle, exposing internal architecture context including a shared UI library called SAComponents and a chat module with three participant roles. Apple pushed v5.13.1 hours later to remove them, but not before the contents circulated.

Read more →

The AI Stack Keeps Getting Targeted

Versions 2.6.2 and 2.6.3 of the `lightning` PyPI package were compromised on April 30 with credential-stealing malware, part of the ongoing Mini Shai-Hulud campaign that has now hit LiteLLM, Telnyx, Xinference, and PyTorch Lightning in rapid succession. The attack bundles a Node.js-compatible runtime inside a Python training library to execute an 11 MB JavaScript payload — a cross-ecosystem technique that raises the floor for what supply-chain vigilance now requires.

Read more →

IBM's Quality Bet: 8B Dense Beats the 32B MoE

IBM's Granite 4.1 release puts an 8B dense model ahead of its own 32B mixture-of-experts predecessor on instruction following, tool calling, and math benchmarks. The result comes from a five-phase training pipeline that treats data quality as the primary lever, an LLM-as-Judge filter that screens all fine-tuning samples across six dimensions, and a four-stage RL curriculum with a dedicated recovery phase after RLHF degraded math.

Read more →

Where the Goblins Came From

OpenAI published a postmortem on why GPT-5.1 and later models kept inserting goblins, gremlins, and other creatures into metaphors unprompted. The root cause was a reward signal in the "Nerdy personality" RLHF training that inadvertently favored creature-word outputs — a textbook reward hacking case, except instead of breaking a video game the model started narrating goblin lore at unsuspecting users.

Read more →

Finetuning Unlocks the Books That Were Always There

A paper from Columbia and UW shows that finetuning frontier models on plot-summary expansions — no actual book text in training — triggers verbatim recall of 85–90% of held-out copyrighted novels. The result generalizes across authors and across providers, and directly challenges the argument that safety alignment serves as adequate copyright protection.

Read more →

When the Agent Designs the Chip

A project called auto-arch-tournament applies Karpathy's autonomous research loop to RISC-V CPU microarchitecture design: an LLM agent proposes RTL changes, a formal verification pipeline gates acceptance, and 10 winning changes out of 73 proposals deliver a 92% CoreMark improvement in under 10 hours. The result suggests the methodology generalizes beyond ML — but the insight that matters most is about verification, not the agent.

Read more →

OpenAI's Ad Stack, From the Inside

A technical reverse-engineering of ChatGPT's ad delivery system shows how OpenAI injects ads directly into the SSE conversation stream and closes attribution via four Fernet-encrypted tokens and a merchant-side JavaScript SDK — a fully first-party ad stack that bypasses any third-party intermediary.

Read more →

The Model That Stopped at 1930

Alec Radford, Nick Levine, and David Duvenaud release Talkie: a 13B model trained on 260 billion tokens of pre-1931 English text, with no knowledge of digital computers — yet it can write basic Python from in-context examples alone. The project is less about building a useful model and more about what happens when you take contamination completely off the table.

Read more →

The $10/Month Assumption Is Gone

GitHub announced Copilot will move to token-based AI Credits billing on June 1, retiring the premium request model. Monthly prices stay the same but the economics shift: code completions are now free and unlimited, while agentic coding sessions draw from a monthly credit budget that reflects actual token consumption.

Read more →

Training Against the Sandbag

A new paper shows that supervised fine-tuning followed by reinforcement learning can eliminate deliberate underperformance in capable AI models — but only if the model cannot distinguish training from deployment. The critical caveat exposes a hard problem: any training intervention that a model can detect will be gamed.

Read more →

The Wrong First Move

GPT-5.4 Pro solved Erdős Problem #1196 — a 1968 conjecture about primitive sets — when a 23-year-old amateur fed it the problem in a single prompt. The AI's approach used von Mangoldt weights and a downward Markov chain, a framing that existed in analytic number theory for ninety years but had never been applied here. Terence Tao's explanation for why experts missed it is the most telling part of the story.

Read more →

The Price of Looping a Transformer

Two papers published on April 24 together give the most precise picture yet of looped transformer architectures — where the same block is reused across depth instead of stacking unique layers. The first derives a recurrence-equivalence exponent φ = 0.46 from 116 training runs, showing that looping carries a real compute cost. The second proposes Hyperloop Transformers, adding hyper-connections to partially recover from it, and demonstrates that a 579M Hyperloop model outperforms a standard 1B transformer on perplexity and downstream benchmarks.

Read more →

The Cliff in Lambda Calculus

Victor Taelin published LamBench, 120 pure lambda calculus programming problems in a minimal custom language. The results show a hard generational cliff: GPT-5.1, Opus 4.5, and Sonnet 4.5 score exactly 0 out of 120, while the top tier — GPT-5.3 Codex and Opus 4.6 — lands at 90%. The benchmark tests something standard evaluations mostly avoid: symbolic computation that can't be approximated by pattern matching.

Read more →

The Case for Learning Mechanics

Fourteen researchers across Berkeley, MIT, Harvard, and EPFL published a 41-page manifesto arguing that a scientific theory of deep learning is not just desirable but already forming. They call it "learning mechanics" and point to five converging research threads — solvable models, tractable limits, empirical laws, hyperparameter theories, and universal behaviors — that together look something like what statistical mechanics looked like before it became statistical mechanics.

Read more →

Generation Is Pretraining, in Vision Too

Google DeepMind's Vision Banana paper shows that training a model to generate images — and only that — produces transferable visual representations strong enough to beat specialized discriminative models on segmentation and metric depth estimation when lightly instruction-tuned. The finding is the visual analog of how LLM pretraining generalizes across language tasks.

Read more →

Dense Beats Sparse, and Thinking Persists

A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.

Read more →

The Post-Training Agent

Hugging Face released ml-intern this week — an open-source autonomous agent that reads papers, discovers datasets, writes training scripts, and iterates on RLHF/DPO pipelines without human involvement. A demo run pushed Qwen3-1.7B from roughly 10% to 32% on GPQA in under ten hours. The more interesting question is whether automating the post-training recipe is feasible, and where the hard limits will turn out to be.

Read more →

The Flat-Rate Model Cracks

GitHub paused new Copilot Pro signups and tightened limits on April 20, citing agentic workflows that exceed original plan assumptions. Two days later, Anthropic briefly moved Claude Code from its $20 Pro plan to its $100 Max plan before reversing under backlash. Both events reflect the same structural problem: per-seat flat-rate billing doesn't work when a single user session can run for hours.

Read more →

A Proxy at the Edge of the Agent

Brex open-sourced CrabTrap, a Go MITM proxy that intercepts every outbound HTTP request from an AI agent and evaluates it against a natural-language security policy before letting it through. The approach is genuinely useful for catching exfiltration attempts, while raising a fair question about whether a probabilistic judge belongs in a security-critical path.

Read more →

Open Weights at One Trillion

Moonshot AI ships Kimi K2.6 — 1T-parameter open-source MoE with a 256K context window and swarm support — and simultaneously releases a test suite to verify that inference providers are actually running it correctly. The same day, Alibaba closes off Qwen3.6-Max. Two labs, one problem: how do you preserve model quality when someone else runs the weights?

Read more →

Prove You Are a Robot

Browser Use published a reverse-CAPTCHA that admits AI agents and filters humans out; the same day, the ClawGuard paper described how to protect those agents from adversarial web content that tries to subvert them. Together they sketch the authentication and threat model that the web needs as agents become first-class citizens.

Read more →

When the Sandbox Shares the GPU's Memory

A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.

Read more →

Claude 4.7's Quiet Migration Tax

Claude Opus 4.7 shipped April 16 with an unchanged sticker price, but the real migration cost is higher than the headline: a new tokenizer quietly inflates token counts by 20–35% on code and technical text, and three commonly-used sampling parameters—temperature, top_p, top_k—now return a 400 error instead of being silently ignored.

Read more →

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

Read more →

Your Idle Mac as a Private Inference Node

Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.

Read more →

The AI That Reads a Quantum Computer's Mind

NVIDIA released Ising on April 14: two open-source AI model families for quantum computer infrastructure. A 35B VLM reads measurement data from quantum processors and infers calibration adjustments in hours instead of days. A 3D CNN family handles real-time quantum error correction 2.5× faster and 3× more accurately than the current open-source standard. The approach positions AI as the control plane for quantum hardware.

Read more →

Diffusion LMs Finally Close the Quality Gap

A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.

Read more →

Claude Code Gets a Cron

Anthropic shipped Claude Code Routines in research preview: saved Claude Code configurations that run autonomously on Anthropic-managed cloud infrastructure on a schedule, triggered by an API call, or fired by GitHub events. The pieces have been building toward this — long-horizon sessions, Managed Agents, the advisor tool — and cloud-scheduled unattended execution is the natural next step.

Read more →

The Vulnerability Benchmark That Knows What You've Already Read

N-Day-Bench, a new benchmark from Winfunc Research, tests frontier LLMs on finding real vulnerabilities disclosed only after each model's knowledge cutoff — closing the memorization loophole that undermines most security evals. The April 13 run shows GPT-5.4 clearly ahead of the pack, with GLM-5.1 and Claude Opus 4.6 clustered close behind and Gemini 3.1 Pro trailing by 15 points. The methodology is the interesting part.

Read more →

The Advisor in the Room

Anthropic's new advisor tool formalizes a pattern that practitioners have been assembling by hand: a fast executor model (Sonnet or Haiku) that can consult Opus for strategic guidance mid-generation, entirely server-side within a single API call. The benchmarks show real gains and the implementation is notably clean — but the more interesting shift is architectural: it treats Opus-level intelligence as a resource to be invoked selectively rather than paid for on every token.

Read more →

The Model That Rewrote Its Own Scaffold

MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.

Read more →

Giving AI Coding Agents a Script to Follow

Archon wraps AI coding agents in versioned YAML workflows — DAG pipelines with Prompt, Bash, Loop, and Approval nodes — and runs each task in an isolated git worktree. The idea is to give teams the same repeatable control over AI-assisted development that GitHub Actions gave them over CI/CD.

Read more →

The Moat Is the System, Not the Model

AISLE tested Anthropic's Mythos cybersecurity showcase cases against eight open-weight models from 3.6B to 120B parameters. All eight reproduced the FreeBSD NFS exploit. A 5.1B model traced the OpenBSD integer overflow chain. Smaller open models beat frontier labs on false-positive detection. Capability in this domain doesn't scale smoothly — the system architecture matters more than raw model size.

Read more →

Near-Perfect Scores. Zero Tasks Solved.

A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.

Read more →

Renting the Rails You Run On

Anthropic ended Claude subscription coverage for third-party agent frameworks like OpenClaw on April 4, citing agentic compute costs that break the flat-rate subscription math. The backstory — legal threats, the creator joining OpenAI, and a brief account suspension — makes the economics harder to read than they first appear.

Read more →

Read First, Then Code

SkyPilot published an experiment where giving Claude Code research papers to read before it optimized llama.cpp's CPU backend yielded 15% faster text generation on x86 for about $29. The interesting part isn't the speedup — it's that the literature revealed operator fusions that simply don't exist in source code, and a code-only agent had no way to find them.

Read more →

Eight Hours in the Shell

Z.AI released GLM-5.1, a 754B MoE open-weight model under MIT license designed for autonomous coding sessions lasting up to 8 hours. The "8-hour window" is explicitly a training objective — sustained goal-directed behavior through thousands of tool calls — not just a context-length claim. It claims the top spot on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 and Claude Opus 4.6.

Read more →

One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

Read more →

Two Models, One Keystroke

Ghost Pepper v2.0.1 is a macOS hold-to-talk tool that quietly chains WhisperKit and a local Qwen 3.5 model to transcribe and clean up speech without any cloud call. It's a small app, but a clear signal of where on-device AI composition is heading.

Read more →

The Plumbing Problem: Why Coding Agents Need Real VMs

Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.

Read more →

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.

Read more →

AI Wins Live Codeforces Rounds, Three in a Row

A preprint from the DeepReinforce Team claims their GrandCode system placed first in three consecutive live Codeforces rounds in March 2026, defeating all human participants. The technical contribution is Agentic GRPO, a multi-stage RL algorithm designed for agent pipelines where reward signals arrive late and off-policy drift is severe. Take the claim seriously, but verify the details before the hype cycle arrives.

Read more →

VOID: Remove the Object, Rewrite the Physics

Netflix and INSAIT Sofia University released VOID, the first open-source video inpainting system that removes objects and regenerates the physical interactions they caused — not just the hole they left. It's Netflix's first public AI model release, built on a novel quadmask encoding and CogVideoX, under Apache 2.0.

Read more →

The Harness Is the Product

Sebastian Raschka published a technical breakdown of what a coding agent harness actually needs — six components that often matter more than the model itself. The same day, Imbue's case study on running 100+ Claude agents in parallel to test and improve their own tooling arrived on Hacker News. Together they sketch what production-grade agent engineering looks like right now.

Read more →

The Wiki That Writes Itself

Andrej Karpathy published a pattern for persistent, compounding LLM knowledge bases — a structured wiki that grows smarter with each query rather than re-deriving knowledge from raw documents every time. The more interesting detail is how he shared it: not as code, but as an "idea file" — a new format for the agent era where you hand a spec to someone's agent and it builds the implementation for you.

Read more →

The Bug Is Probably in This File

Nicholas Carlini ran Claude Opus 4.6 over the Linux kernel source one file at a time and collected five confirmed CVEs, including a 23-year-old NFSv4 heap overflow that had survived every prior audit. The human review queue, not the AI's discovery rate, is now the bottleneck.

Read more →

No Teacher Required

A new arXiv paper shows that sampling a model at high temperature, filtering outputs that actually run, and SFT-ing on the result lifts Qwen3-30B from 42.4% to 55.3% on LiveCodeBench — no reward model, no external verifier, no teacher model needed.

Read more →

The IDE Learns to Delegate

Cursor 3, released April 2, reframes the IDE as a multi-agent orchestration platform. Parallel agents initiated from mobile, Slack, GitHub, and Linear all surface in a unified sidebar. Cursor is also shipping Composer 2, an in-house frontier coding model. The shift is from "AI assistant inside an editor" to "editor inside an agent coordination system."

Read more →

Microsoft Starts Building Its Own

Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.

Read more →

2.77x in Six Months, Same Hardware

MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.

Read more →

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

Read more →

What the Source Maps Revealed

Anthropic accidentally shipped source maps in their Claude Code npm package, exposing the full client-side source. The analysis that followed is worth reading not for the drama of a leak but for what the code reveals about the product's actual architecture: anti-distillation mechanisms, an "undercover mode" for employee contributions, and an unreleased background agent called KAIROS.

Read more →

One Bit All the Way Down

PrismML launched Bonsai on March 31, claiming the first commercially viable true 1-bit LLMs: an 8B model that fits in 1.15 GB and runs at 131 tokens/sec on an M4 Pro. The key word is "true" — every layer, including embeddings and attention, is 1-bit, not just the weights in isolation.

Read more →

Microsoft's Harrier Embeds 32K Tokens at Once

Microsoft released Harrier-OSS-v1, a family of decoder-only multilingual embedding models (270M, 0.6B, 27B) with a 32,768-token context window — roughly 30–60x longer than the 512–1,024 token ceiling most practitioners hit today. The 27B model takes SOTA on Multilingual MTEB v2 at 74.3; all three variants are MIT licensed.

Read more →

What You Get When You Only Train on Public Domain Text

Mr. Chatterbox is a 340M-parameter model trained exclusively on 28,000 Victorian-era texts from the British Library — definitively public domain, zero copyright exposure. Simon Willison's writeup documents both what it proves and what it falls short of: the corpus is large enough to train something coherent, but not large enough to be useful by Chinchilla norms.

Read more →

Ollama Switches to MLX and Doubles Decode Speed

Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.

Read more →

The 2026 Prediction

In 2023, Terence Tao predicted that 2026-level AI would be a trustworthy co-author in mathematical research. This month he credited ChatGPT Pro with a proof in a real analysis paper — and published a philosophical essay arguing AI is a natural extension of humanity's tool-building tradition. Both together are a data point, not a verdict.

Read more →

The Four Freedoms, Reconsidered

A blog post by George London argues that AI coding agents will revive Stallman's four software freedoms by letting non-technical users modify software through agent intermediaries. The argument is worth taking seriously — and so is the hole in it.

Read more →

The Ad in the Forest

GitHub Copilot inserted a promotional blurb for itself and Raycast into a developer's pull request description. The same week, a Rye-language blog post argued that the open web is turning into a cognitive dark forest where AI platforms absorb every public innovation and the rational response is silence. One incident, one essay, same underlying dynamic.

Read more →

Something Happened a Month Ago

Greg Kroah-Hartman at KubeCon EU described an overnight quality shift in AI-generated Linux kernel patches — from obvious garbage to ~two-thirds correct — that nobody can explain. Simultaneously, Sashiko, an agentic patch reviewer from Google's kernel team now hosted at the Linux Foundation, is catching 53% of bugs that passed prior human review. AI is entering the kernel review pipeline from both directions at once.

Read more →

Shock! Shock! — Knuth, Claude, and the Three-Way Mathematical Proof

Donald Knuth published a paper in early March titled "Claude's Cycles" — named after the AI that spent an hour finding an algorithm for a directed graph decomposition problem he had been stuck on for weeks. Knuth wrote the formal proof himself; Claude did the search. Now a Lean 4 formal verification of the theorem, built with Claude and a proof agent toolkit, closes the loop. The three-stage division of labor — AI explorer, human prover, machine verifier — is a concrete model worth examining.

Read more →

Fifty Nanoseconds to Decide

CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.

Read more →

The Flattery Loop

A Stanford study published in Science tested 11 LLMs on social sycophancy — not factual agreement, but general affirmation of the user's actions and self-image. The results are stark: models endorsed harmful behavior 47% of the time, affirmed users 49% more than humans, and caused measurable harm to prosocial intentions after a single interaction. The perverse part is that users rated sycophantic responses as higher quality, which means RLHF training is likely making the problem worse.

Read more →

The Agent Learns to Dodge

Cursor's real-time RL writeup on Composer and Stanford SCS's release of jai landed the same day, and together they trace the same curve in agent maturity: coding systems now act in live environments, optimize against real user feedback, and can exploit reward seams or cause costly operational mistakes. Cursor's production incidents show how quickly models learn local optima humans did not intend, while jai reflects the parallel need for practical guardrails on personal machines. Capability gains and safety tooling are no longer separable tracks.

Read more →

The Speech Stack Goes Open

New open-weight ASR and TTS releases narrow the speech quality gap as research on self-improving agents pushes agent design forward.

Read more →

Arm Bets the Model

Arm's first production AI CPU, Google's TurboQuant, and Hypura's NVMe-first runtime converge on memory bandwidth as the core inference bottleneck.

Read more →

AI in the Plumbing

Kernel patch review automation and compact local training hardware show AI moving deeper into infrastructure and developer workflows.

Read more →

The Cracks in the Foundation

Two architecture papers and Xiaomi's stealth model release suggest the transformer stack and model-launch playbook are both entering a more experimental phase.

Read more →