OpenAI published a detailed engineering writeup on how they rebuilt their WebRTC stack for the Realtime API to run on Kubernetes at scale — separating a lightweight UDP relay from the stateful WebRTC transceiver and using the ICE ufrag as a routing hook embedded in standard protocol headers.
A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.
Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.
MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.
Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.
MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.
Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.
CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.