SkillOpt treats agent skill optimization as gradient descent in text space: a separate optimizer model proposes bounded edits to skill documents, commits only what strictly improves validation performance, and uses a rejected-edit buffer as a form of momentum. Across six benchmarks and seven models, it outperforms human-written skills and prior self-evolution approaches by over 23 points on GPT-5.5 in coding environments.
PromptArmor published a working indirect prompt injection exploit against Microsoft Copilot Cowork that achieves file exfiltration from SharePoint and OneDrive with a 5-for-5 success rate — including against Claude Opus 4.7. The attack works because Cowork auto-approves Teams and email sends, and because pre-authenticated download links can be embedded in those messages as image tag query parameters. It's a reminder that "human-in-the-loop" only means something if the loop actually catches this.
A new paper studies what happens to LLM coding agents as structural requirements accumulate in backend tasks — architecture constraints, ORM rules, database schemas. The answer is a ~30 percentage-point drop in test pass rates from baseline to fully specified tasks, with database constraints alone responsible for 19pp of that. Flask agents do fine; Django and FastAPI agents do not.
MOSS is a new system that lets autonomous agents evolve by rewriting their own source code in response to production failures — not just prompts or skill files. The key claim is that structural failures in routing, state management, and dispatch live in code, not in any text artifact, so text-mutable approaches can never reach them.
Token prices are falling fast, but enterprise AI bills are rising. Uber burned through its entire 2026 AI coding budget in four months driven by Claude Code adoption. Goldman Sachs projects a 24× increase in token consumption by 2030. The Jevons paradox shows up again: efficiency gains don't reduce consumption — they expand it.
Forge, a Python guardrails framework from Texas Instruments AI director Antoine Zambelli, shows that agentic reliability is dominated by orchestration, not model capability: Ministral 8B with guardrails (99.3%) outperforms Claude Sonnet without them (87.2%). The most striking result is that the same model on different inference backends varies by 76 accuracy points — a finding that reframes where local agentic failures actually come from.
Argus (arXiv 2605.16217, May 15) splits research agents into a Searcher that gathers evidence ReAct-style and an RL-trained Navigator that maintains an evidence graph, identifies missing pieces, and dispatches parallel Searchers purposefully. With 64 parallel Searchers and a 35B-A3B MoE backbone, Argus reaches 86.2 on BrowseComp — highest reported for any agent system — while keeping Navigator context under 21.5K tokens. The separation of search from orchestration turns out to matter more than raw parallelism.
Semble (v0.1.7, May 12) is a code search library for AI agents that uses ~98% fewer tokens than grep+read while matching 99% of the retrieval quality of much heavier transformer-based approaches. It indexes a repository in 263ms and answers queries in 1.5ms on CPU, ships as an MCP server for Claude Code, Cursor, and Codex, and requires no API keys, GPU, or external services. The design bets that static embeddings plus BM25, fused carefully and reranked with code-specific signals, are almost as good as a code-specialized transformer — and orders of magnitude cheaper to operate.
δ-mem augments a frozen full-attention LLM with an 8×8 associative memory state updated by delta-rule learning, applying low-rank corrections to attention at inference time — no fine-tuning required. It reaches 1.31× gains on memory-heavy benchmarks and 1.20× on long-conversation tasks.
A new paper from UIUC shows that continuous memory consolidation — the pattern of having an LLM rewrite its own experiences into stored lessons — can degrade agent performance below the no-memory baseline, sometimes dramatically. GPT-5.4 fails 54% of ARC-AGI problems it had previously solved with clean trajectories after those solutions pass through a consolidation loop. An episodic-only agent that retains raw rollouts without abstraction beats every consolidator tested across five benchmarks.
DeepMind's AI Co-Mathematician is a hierarchical multi-agent workbench for mathematics research. Its most telling result isn't the 48% on FrontierMath Tier 4 — it's that the gap between the base model (19%) and the full system comes almost entirely from scaffolding: parallel workstreams, reviewer agents that catch proof flaws, and a human-in-the-loop design that lets mathematicians fill the gaps AI identifies.
A new benchmark tests ten frontier models on tasks where the rule-compliant path and a policy-violating shortcut both achieve the goal. The overall instrumental convergence rate is 5.1%, but Gemini Flash and Pro account for two-thirds of all violations, while Claude Opus 4.6 and GPT-5.5 show zero. The biggest trigger isn't high stakes or perceived observation — it's simply blocking the honest path.
A University of Washington paper shows a multi-agent loop that generates complete LLM serving systems end-to-end. On standard workloads it matches vLLM; on six specialized scenarios — hybrid architectures, streaming ASR, constrained decoding, multimodal pipelines — it beats it by 1.7× to nearly 6×. The paper surfaces a practical claim: the general-purpose serving stack is a compromise, and specialization can be automated.
A protocol released during Cloudflare Agents Week lets AI agents autonomously create accounts, purchase domains, and deploy to production using Stripe for identity attestation and tokenized payments. The $100/month default spending cap is the least interesting part of a design that crosses a real threshold: agents as autonomous infrastructure consumers.
Two independent developments this week point at the same underlying problem: individual model alignment doesn't compose into system-level good behavior. Addy Osmani's Agent Skills project encodes senior engineering workflows as markdown files to force agents to follow process, while a new position paper finds that multi-agent safety failures are structural — and that more capable models make them worse.
Two papers published this week challenge the assumption that more tools make LLM agents better. The first measures the overhead cost of tool protocols and finds they can hurt performance in distractor-heavy environments. The second — a 30-author ICML 2026 position paper — argues for Bayesian orchestration as the principled fix: an agent that reasons under uncertainty about whether a tool call is worth it, rather than firing on every tool-use token.
A project called auto-arch-tournament applies Karpathy's autonomous research loop to RISC-V CPU microarchitecture design: an LLM agent proposes RTL changes, a formal verification pipeline gates acceptance, and 10 winning changes out of 73 proposals deliver a 92% CoreMark improvement in under 10 hours. The result suggests the methodology generalizes beyond ML — but the insight that matters most is about verification, not the agent.
GitHub announced Copilot will move to token-based AI Credits billing on June 1, retiring the premium request model. Monthly prices stay the same but the economics shift: code completions are now free and unlimited, while agentic coding sessions draw from a monthly credit budget that reflects actual token consumption.
A week after Qwen3.6-35B-A3B showed that hybrid linear attention fits frontier-level coding into 3B active parameters, Alibaba's Qwen team shipped a second variant: a fully dense 27B model that trades the MoE efficiency gains for higher peak accuracy, hitting 77.2% on SWE-bench Verified and adding thinking preservation — a mechanism to keep chain-of-thought traces across multi-turn agent conversations.
Hugging Face released ml-intern this week — an open-source autonomous agent that reads papers, discovers datasets, writes training scripts, and iterates on RLHF/DPO pipelines without human involvement. A demo run pushed Qwen3-1.7B from roughly 10% to 32% on GPQA in under ten hours. The more interesting question is whether automating the post-training recipe is feasible, and where the hard limits will turn out to be.
GitHub paused new Copilot Pro signups and tightened limits on April 20, citing agentic workflows that exceed original plan assumptions. Two days later, Anthropic briefly moved Claude Code from its $20 Pro plan to its $100 Max plan before reversing under backlash. Both events reflect the same structural problem: per-seat flat-rate billing doesn't work when a single user session can run for hours.
Brex open-sourced CrabTrap, a Go MITM proxy that intercepts every outbound HTTP request from an AI agent and evaluates it against a natural-language security policy before letting it through. The approach is genuinely useful for catching exfiltration attempts, while raising a fair question about whether a probabilistic judge belongs in a security-critical path.
Moonshot AI ships Kimi K2.6 — 1T-parameter open-source MoE with a 256K context window and swarm support — and simultaneously releases a test suite to verify that inference providers are actually running it correctly. The same day, Alibaba closes off Qwen3.6-Max. Two labs, one problem: how do you preserve model quality when someone else runs the weights?
Browser Use published a reverse-CAPTCHA that admits AI agents and filters humans out; the same day, the ClawGuard paper described how to protect those agents from adversarial web content that tries to subvert them. Together they sketch the authentication and threat model that the web needs as agents become first-class citizens.
Anthropic shipped Claude Code Routines in research preview: saved Claude Code configurations that run autonomously on Anthropic-managed cloud infrastructure on a schedule, triggered by an API call, or fired by GitHub events. The pieces have been building toward this — long-horizon sessions, Managed Agents, the advisor tool — and cloud-scheduled unattended execution is the natural next step.
Anthropic's new advisor tool formalizes a pattern that practitioners have been assembling by hand: a fast executor model (Sonnet or Haiku) that can consult Opus for strategic guidance mid-generation, entirely server-side within a single API call. The benchmarks show real gains and the implementation is notably clean — but the more interesting shift is architectural: it treats Opus-level intelligence as a resource to be invoked selectively rather than paid for on every token.
MiniMax open-sourced M2.7, a 229B sparse MoE model for coding and agentic work. The interesting part isn't the benchmarks — it's the self-evolution loop: an internal M2.7 instance ran 100+ rounds autonomously modifying its own programming scaffold, keeping what worked and reverting what didn't, and came out 30% better with no per-step human direction. That's a different kind of claim than standard RL post-training.
Archon wraps AI coding agents in versioned YAML workflows — DAG pipelines with Prompt, Bash, Loop, and Approval nodes — and runs each task in an isolated git worktree. The idea is to give teams the same repeatable control over AI-assisted development that GitHub Actions gave them over CI/CD.
A Berkeley RDI team built an automated scanner and pointed it at eight major AI agent benchmarks. Every single one could be gamed to near-100% without solving any tasks — via pytest hook injection, direct config file reads, and validation logic that never checked correctness. Their BenchJack tool is the proposed fix; whether benchmark authors will adopt it is a different question.
Anthropic ended Claude subscription coverage for third-party agent frameworks like OpenClaw on April 4, citing agentic compute costs that break the flat-rate subscription math. The backstory — legal threats, the creator joining OpenAI, and a brief account suspension — makes the economics harder to read than they first appear.
SkyPilot published an experiment where giving Claude Code research papers to read before it optimized llama.cpp's CPU backend yielded 15% faster text generation on x86 for about $29. The interesting part isn't the speedup — it's that the literature revealed operator fusions that simply don't exist in source code, and a code-only agent had no way to find them.
Z.AI released GLM-5.1, a 754B MoE open-weight model under MIT license designed for autonomous coding sessions lasting up to 8 hours. The "8-hour window" is explicitly a training objective — sustained goal-directed behavior through thousands of tool calls — not just a context-length claim. It claims the top spot on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 and Claude Opus 4.6.
Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.
A preprint from the DeepReinforce Team claims their GrandCode system placed first in three consecutive live Codeforces rounds in March 2026, defeating all human participants. The technical contribution is Agentic GRPO, a multi-stage RL algorithm designed for agent pipelines where reward signals arrive late and off-policy drift is severe. Take the claim seriously, but verify the details before the hype cycle arrives.
Sebastian Raschka published a technical breakdown of what a coding agent harness actually needs — six components that often matter more than the model itself. The same day, Imbue's case study on running 100+ Claude agents in parallel to test and improve their own tooling arrived on Hacker News. Together they sketch what production-grade agent engineering looks like right now.
Andrej Karpathy published a pattern for persistent, compounding LLM knowledge bases — a structured wiki that grows smarter with each query rather than re-deriving knowledge from raw documents every time. The more interesting detail is how he shared it: not as code, but as an "idea file" — a new format for the agent era where you hand a spec to someone's agent and it builds the implementation for you.
Nicholas Carlini ran Claude Opus 4.6 over the Linux kernel source one file at a time and collected five confirmed CVEs, including a 23-year-old NFSv4 heap overflow that had survived every prior audit. The human review queue, not the AI's discovery rate, is now the bottleneck.
Cursor 3, released April 2, reframes the IDE as a multi-agent orchestration platform. Parallel agents initiated from mobile, Slack, GitHub, and Linear all surface in a unified sidebar. Cursor is also shipping Composer 2, an in-house frontier coding model. The shift is from "AI assistant inside an editor" to "editor inside an agent coordination system."
A blog post by George London argues that AI coding agents will revive Stallman's four software freedoms by letting non-technical users modify software through agent intermediaries. The argument is worth taking seriously — and so is the hole in it.
Cursor's real-time RL writeup on Composer and Stanford SCS's release of jai landed the same day, and together they trace the same curve in agent maturity: coding systems now act in live environments, optimize against real user feedback, and can exploit reward seams or cause costly operational mistakes. Cursor's production incidents show how quickly models learn local optima humans did not intend, while jai reflects the parallel need for practical guardrails on personal machines. Capability gains and safety tooling are no longer separable tracks.