Inference · AI Beat

15 Jul 2026 · AI Beat Desk

A 27B Model in 3.9 Gigabytes

PrismML released Bonsai 27B on July 14: 1-bit binary and ternary builds of Qwen3.6-27B that fit in 3.9 GB and 5.9 GB respectively, run at 11 tok/s on an iPhone 17 Pro, and retain over 90% and 95% of full-precision benchmark performance. The compression factor is around 14× versus FP16, and the models are available under Apache 2.0.

12 Jul 2026 · AI Beat Desk

The Inference Mesh, No Cloud Required

Mesh LLM, published yesterday on the iroh blog, routes LLM inference across a peer-to-peer mesh with no central coordinator — requests go locally, to a peer that already has the model loaded, or split by layer range across multiple nodes via the "Skippy" engine. It works well on a LAN and becomes impractical across the internet, for a predictable reason.

10 Jul 2026 · AI Beat Desk

Streaming 744 Billion Parameters from Disk

Colibri, a ~1300-line pure-C engine posted on Hacker News overnight, runs the 744B GLM-5.2 MoE on a 25GB-RAM consumer machine by streaming routed experts from NVMe on demand. It's not fast, but it works — and the architectural insight it exploits (most of a MoE's parameters are cold at any given token) points to a design pattern that will matter more as open-weight frontier models keep growing.

05 Jul 2026 · AI Beat Desk

The Model That Passed as Anonymous

Meituan's LongCat-2.0 — a 1.6T-parameter open-weight MoE trained entirely on domestic Chinese ASICs — spent two months deployed anonymously on OpenRouter as "Owl Alpha," quietly reaching #1 on Hermes Agent and #2 on Claude Code before the company claimed it. The reveal is technically notable, but the verification gaps are worth keeping in view.

28 Jun 2026 · AI Beat Desk

DeepSeek Ships Speculative Decoding to Production and Open-Sources the Whole Stack

DeepSeek released DSpark on June 27 — a semi-parallel speculative decoding framework already running in production for DeepSeek-V4 — alongside DeepSpec, an MIT-licensed toolkit packaging three drafting algorithms with complete training and evaluation pipelines. Together they let anyone train a custom draft model for their own target LLM, not just the models DeepSeek ships.

25 Jun 2026 · AI Beat Desk

Mojo Goes to Qualcomm

Qualcomm agreed to acquire Modular for approximately $3.9 billion on June 24. Modular makes Mojo (a Python-superset systems language) and MAX (a hardware-agnostic inference engine). The deal is a bet that AI inference will fracture across hardware vendors, and whoever owns the abstraction layer wins.

24 Jun 2026 · AI Beat Desk

2.5 Million Parameters Beats Gboard

FUTO released the models behind their swipe keyboard — a three-component stack totalling 2.5 million parameters that achieves 26% fewer errors than Gboard on their benchmark. It trains on one workstation GPU, runs on low-end Android devices in milliseconds, and is the first freely licensed open swipe-typing model. It's a reminder that model scale is a tool, not an objective.

12 Jun 2026 · AI Beat Desk

Text Diffusion Reaches Consumer Hardware

Google's DiffusionGemma 26B-A4B is a discrete text diffusion model that generates tokens in parallel blocks rather than left-to-right, hitting 1100+ tokens/sec on a single H100 and fitting in 18 GB of VRAM quantized. It's open under Apache 2.0 and marks the first time a production-quality diffusion LM from a major lab lands on consumer hardware — with real benchmark results showing what you trade away for that speed.

10 Jun 2026 · AI Beat Desk

OpenCV Turns 25 and Learns to Run LLMs

OpenCV 5.0 ships a ground-up rewrite of its DNN engine: ONNX operator coverage jumps from 22% to 80%+, and native LLM/VLM support lands in a library already deployed across embedded systems, medical devices, and industrial hardware that can't run PyTorch.

09 Jun 2026 · AI Beat Desk

A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

08 Jun 2026 · AI Beat Desk

CUDA Comes to Your Laptop

NVIDIA's RTX Spark puts a Blackwell GPU and full CUDA stack inside a laptop SoC — enough to run a 120B-parameter model locally with 1M-token context. At roughly the same moment, Perplexity shipped a hybrid inference orchestrator that uses a compact on-device model to automatically decide which tasks stay local and which escalate to the cloud. Together they sketch what a local-AI platform actually looks like in hardware and software.

05 Jun 2026 · AI Beat Desk

The KV Cache Is More Compressible Than You Think

Two papers published this week attack the KV cache memory bottleneck from opposite directions: one proposes sharing key and value projections at training time for a 50% cache reduction with 3.1% perplexity cost, the other quantizes stored cache values to 4-bit keys and 2-bit values with no calibration required and throughput above FP16. Together they suggest the cache is far more compressible than inference engineers typically assume.

05 Jun 2026 · AI Beat Desk

Magenta RealTime 2 Is Actually an Instrument Now

Google's Magenta RealTime 2 cuts live music generation control latency from ~3 seconds to ~200ms by shifting from chunk-based to frame-level causal processing. It runs locally on Apple Silicon MacBooks as open weights, and the latency reduction is the difference between a studio tool and something a musician can actually play.

03 Jun 2026 · AI Beat Desk

AMD's FP8 Problem, and What It Costs

A detailed engineering account of bringing DeepSeek-V4-Flash up on AMD MI300X reveals the real cost of AMD's software ecosystem gaps: FP8 format fragmentation, missing kernels, and HIP graph constraints that each required dedicated engineering effort before getting to 2,700 tokens/s.

01 Jun 2026 · AI Beat Desk

Image Generation at 1 Bit

PrismML's Bonsai Image 4B applies 1-bit and ternary quantization to a FLUX.2 Klein diffusion transformer, compressing it 8.3× to 0.93 GB — small enough to generate images on an iPhone in under 10 seconds. It's the first demonstration that extreme quantization techniques developed for language models transfer cleanly to diffusion architectures.

30 May 2026 · AI Beat Desk

Liquid AI's LFM2.5: When Half Your Layers Aren't Attention

Liquid AI ships LFM2.5-8B-A1B, a 38T-token trained hybrid model where 18 of 24 layers are gated convolution blocks rather than attention — and it reaches 253 tokens/second on an M5 Max CPU with under 6 GB of memory.

23 May 2026 · AI Beat Desk

Cheaper Per Token, More Expensive Overall

Token prices are falling fast, but enterprise AI bills are rising. Uber burned through its entire 2026 AI coding budget in four months driven by Claude Code adoption. Goldman Sachs projects a 24× increase in token consumption by 2030. The Jevons paradox shows up again: efficiency gains don't reduce consumption — they expand it.

22 May 2026 · AI Beat Desk

The Rest of the Transformer, Fused

CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.

17 May 2026 · AI Beat Desk

Sixty-Four Cells of Memory

δ-mem augments a frozen full-attention LLM with an 8×8 associative memory state updated by delta-rule learning, applying low-rank corrections to attention at inference time — no fine-tuning required. It reaches 1.31× gains on memory-heavy benchmarks and 1.20× on long-conversation tasks.

17 May 2026 · AI Beat Desk

One Minute of 720p World on One GPU

NVIDIA's SANA-WM generates 60-second, 720p video from a single image and a camera trajectory — on a single GPU. The open-source 2.6B-parameter model achieves 36× higher throughput than prior open-source world models and ships under Apache 2.0.

16 May 2026 · AI Beat Desk

Speculative Decoding Has an Acceptance Problem You Can Exploit

Mistletoe (arXiv 2605.14005) demonstrates a stealthy adversarial attack on speculative decoding systems: craft inputs that look normal to the target model but cause the draft model to disagree, collapsing acceptance length and throughput while leaving output quality and perplexity unchanged. The attack exploits the fundamental gap between draft and target distributions that all speculative systems rely on bridging.

16 May 2026 · AI Beat Desk

The Draft Model You Don't Have to Train

Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.

13 May 2026 · AI Beat Desk

Needle: What a 26M-Parameter Model Says About Tool Calling

Cactus Compute released Needle, a 26M-parameter MIT-licensed model for on-device function calling that strips out all feed-forward networks from the transformer. The architectural choice is a thesis: tool calling is retrieval-and-routing, not reasoning, and attention is the right primitive for it. The numbers are striking — 6000 tok/s prefill on consumer hardware — even if the playground has rough edges.

12 May 2026 · AI Beat Desk

NVIDIA's cuda-oxide Wants GPU Kernels Written in Rust

NVIDIA's NVlabs released cuda-oxide v0.1.0 on May 7, an experimental compiler that takes standard Rust and emits NVIDIA PTX directly — no CUDA C++, no DSLs, no foreign language bindings. The pipeline goes through a custom rustc codegen backend and a Rust-native MLIR-like IR called Pliron. Alpha-stage and Linux-only, but it signals where NVIDIA thinks GPU kernel development might eventually land.

10 May 2026 · AI Beat Desk

The Serving Stack Writes Itself

A University of Washington paper shows a multi-agent loop that generates complete LLM serving systems end-to-end. On standard workloads it matches vLLM; on six specialized scenarios — hybrid architectures, streaming ASR, constrained decoding, multimodal pipelines — it beats it by 1.7× to nearly 6×. The paper surfaces a practical claim: the general-purpose serving stack is a compromise, and specialization can be automated.

08 May 2026 · AI Beat Desk

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

06 May 2026 · AI Beat Desk

Gemma 4 Gets Speculative Decoding That Ships

Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.

21 Apr 2026 · AI Beat Desk

Open Weights at One Trillion

Moonshot AI ships Kimi K2.6 — 1T-parameter open-source MoE with a 256K context window and swarm support — and simultaneously releases a test suite to verify that inference providers are actually running it correctly. The same day, Alibaba closes off Qwen3.6-Max. Two labs, one problem: how do you preserve model quality when someone else runs the weights?

19 Apr 2026 · AI Beat Desk

When the Sandbox Shares the GPU's Memory

A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.

17 Apr 2026 · AI Beat Desk

Qwen3.6 Fits in a Laptop and Ships a Novel Architecture

Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.

16 Apr 2026 · AI Beat Desk

Your Idle Mac as a Private Inference Node

Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.

15 Apr 2026 · AI Beat Desk

Diffusion LMs Finally Close the Quality Gap

A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.

06 Apr 2026 · AI Beat Desk

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.

03 Apr 2026 · AI Beat Desk

Microsoft Starts Building Its Own

Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.

03 Apr 2026 · AI Beat Desk

2.77x in Six Months, Same Hardware

MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.

02 Apr 2026 · AI Beat Desk

Thirty People, Four Hundred Billion Parameters

Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.

31 Mar 2026 · AI Beat Desk

Ollama Switches to MLX and Doubles Decode Speed

Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.

28 Mar 2026 · AI Beat Desk

Fifty Nanoseconds to Decide

CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.

25 Mar 2026 · AI Beat Desk

Arm Bets the Model

Arm's first production AI CPU, Google's TurboQuant, and Hypura's NVMe-first runtime converge on memory bandwidth as the core inference bottleneck.

23 Mar 2026 · AI Beat Desk

397 Billion Parameters, One Laptop

Flash-MoE shows how SSD-streamed experts let a 397B-parameter MoE run locally on consumer Apple Silicon hardware.