Infrastructure · AI Beat

28 Jun 2026 · AI Beat Desk

The Circuits AI Designs That No Human Would Have Drawn

Princeton's Kaushik Sengupta describes in IEEE Spectrum how reinforcement learning and electromagnetic emulation have crossed a threshold in radio frequency chip design: AI-generated circuits now routinely outperform human-designed ones, and the layouts look like QR codes — novel topologies that no human designer would produce or easily read.

25 Jun 2026 · AI Beat Desk

Mojo Goes to Qualcomm

Qualcomm agreed to acquire Modular for approximately $3.9 billion on June 24. Modular makes Mojo (a Python-superset systems language) and MAX (a hardware-agnostic inference engine). The deal is a bet that AI inference will fracture across hardware vendors, and whoever owns the abstraction layer wins.

19 Jun 2026 · AI Beat Desk

MCP Gets Its Enterprise Authorization Layer

The Model Context Protocol stabilizes Enterprise-Managed Authorization: organizations configure MCP server access once through their identity provider and users get zero-touch provisioning via an Identity Assertion JWT flow, no per-server consent screens. Okta is the first supported IdP, with Claude, Claude Code, and VS Code 1.123 as the first clients. It's the plumbing that turns MCP from a developer prototype into something an enterprise can actually operate.

10 Jun 2026 · AI Beat Desk

OpenCV Turns 25 and Learns to Run LLMs

OpenCV 5.0 ships a ground-up rewrite of its DNN engine: ONNX operator coverage jumps from 22% to 80%+, and native LLM/VLM support lands in a library already deployed across embedded systems, medical devices, and industrial hardware that can't run PyTorch.

09 Jun 2026 · AI Beat Desk

A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

08 Jun 2026 · AI Beat Desk

CUDA Comes to Your Laptop

NVIDIA's RTX Spark puts a Blackwell GPU and full CUDA stack inside a laptop SoC — enough to run a 120B-parameter model locally with 1M-token context. At roughly the same moment, Perplexity shipped a hybrid inference orchestrator that uses a compact on-device model to automatically decide which tasks stay local and which escalate to the cloud. Together they sketch what a local-AI platform actually looks like in hardware and software.

06 Jun 2026 · AI Beat Desk

Training the Compression In: Gemma 4 QAT for Mobile

Google released quantization-aware training checkpoints for Gemma 4 with a new mobile-specific format — channel-wise quantization aligned with NPU memory layouts, 2-bit compression for token generation layers, pre-calculated scaling constants — bringing the Gemma 4 E2B text model under 1 GB of memory.

05 Jun 2026 · AI Beat Desk

The KV Cache Is More Compressible Than You Think

Two papers published this week attack the KV cache memory bottleneck from opposite directions: one proposes sharing key and value projections at training time for a 50% cache reduction with 3.1% perplexity cost, the other quantizes stored cache values to 4-bit keys and 2-bit values with no calibration required and throughput above FP16. Together they suggest the cache is far more compressible than inference engineers typically assume.

03 Jun 2026 · AI Beat Desk

AMD's FP8 Problem, and What It Costs

A detailed engineering account of bringing DeepSeek-V4-Flash up on AMD MI300X reveals the real cost of AMD's software ecosystem gaps: FP8 format fragmentation, missing kernels, and HIP graph constraints that each required dedicated engineering effort before getting to 2,700 tokens/s.

05 May 2026 · AI Beat Desk

How OpenAI Ran WebRTC Through Kubernetes

OpenAI published a detailed engineering writeup on how they rebuilt their WebRTC stack for the Realtime API to run on Kubernetes at scale — separating a lightweight UDP relay from the stateful WebRTC transceiver and using the ICE ufrag as a routing hook embedded in standard protocol headers.

19 Apr 2026 · AI Beat Desk

When the Sandbox Shares the GPU's Memory

A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.

16 Apr 2026 · AI Beat Desk

Your Idle Mac as a Private Inference Node

Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.

09 Apr 2026 · AI Beat Desk

One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

07 Apr 2026 · AI Beat Desk

The Plumbing Problem: Why Coding Agents Need Real VMs

Freestyle launched today with <50ms VM forking for AI coding agent workloads, built on bare metal they own because cloud margins didn't pencil out. It's a signal that the agent infrastructure layer is serious enough to warrant serious systems work.

03 Apr 2026 · AI Beat Desk

2.77x in Six Months, Same Hardware

MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.

31 Mar 2026 · AI Beat Desk

Ollama Switches to MLX and Doubles Decode Speed

Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.

28 Mar 2026 · AI Beat Desk

Fifty Nanoseconds to Decide

CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.

25 Mar 2026 · AI Beat Desk

Arm Bets the Model

Arm's first production AI CPU, Google's TurboQuant, and Hypura's NVMe-first runtime converge on memory bandwidth as the core inference bottleneck.

22 Mar 2026 · AI Beat Desk

AI in the Plumbing

Kernel patch review automation and compact local training hardware show AI moving deeper into infrastructure and developer workflows.