MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.
Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.
MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.
Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.
CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.