One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

Read more →

The Plumbing Problem: Why Coding Agents Need Real VMs

Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.

Read more →

2.77x in Six Months, Same Hardware

MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.

Read more →

Ollama Switches to MLX and Doubles Decode Speed

Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.

Read more →

Fifty Nanoseconds to Decide

CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.

Read more →

Arm Bets the Model

Arm's first production AI CPU, Google's TurboQuant, and Hypura's NVMe-first runtime converge on memory bandwidth as the core inference bottleneck.

Read more →

AI in the Plumbing

Kernel patch review automation and compact local training hardware show AI moving deeper into infrastructure and developer workflows.

Read more →