The Homework CLAUDE.md

Stanford CS336 shipped a CLAUDE.md file in its assignment repositories that instructs coding agents to act as Socratic tutors rather than solution generators. It is a small thing technically and a significant thing conceptually: domain-specific behavior specification embedded directly in the project.

Read more →

The Message Hidden in the Build Log

jqwik 1.10.0, a Java property-based testing library, ships seven lines of code that write a prompt injection message to stdout — invisible on interactive terminals via ANSI erase codes, but fully readable in the captured output that CI systems and coding agents consume. It's the first known case of a library maintainer deliberately embedding text aimed at AI agents in a routine patch release, and it points at a supply-chain attack surface that current tooling ignores entirely.

Read more →

Product-Market Fit, Demonstrated in Invoices

Simon Willison's May 27 analysis documents the concrete evidence that enterprise coding agents have found genuine product-market fit: Uber burned through its entire 2026 AI budget in four months, Anthropic signed a $1.25B/month compute deal with xAI through 2029, and Anthropic is on track for a first profitable quarter. The signal is in the invoices.

Read more →

The Terminal Agent That Bets Everything on the Cache

DeepSeek Reasonix is a DeepSeek-native terminal coding agent that treats prefix-cache stability as a first-class invariant rather than a side effect. With 99.82% cache hit rates in reported benchmarks, it cuts a heavy session from ~$61 to ~$12 — deliberately by coupling tightly to one provider's caching behavior instead of staying provider-agnostic.

Read more →

Zero Full Solves

ProgramBench, from the SWE-bench team at Meta, Stanford, and Harvard, asks agents to reconstruct real programs from only a binary and documentation — no source code, no internet. No model fully solves any task. The best performer clears 95% of behavioral tests on just 3% of tasks. The benchmark exposes a specific gap: AI agents can generate plausible code but cannot yet architect software at the structural level of real-world programs.

Read more →