Liquid AI ships LFM2.5-8B-A1B, a 38T-token trained hybrid model where 18 of 24 layers are gated convolution blocks rather than attention — and it reaches 253 tokens/second on an M5 Max CPU with under 6 GB of memory.
Token prices are falling fast, but enterprise AI bills are rising. Uber burned through its entire 2026 AI coding budget in four months driven by Claude Code adoption. Goldman Sachs projects a 24× increase in token consumption by 2030. The Jevons paradox shows up again: efficiency gains don't reduce consumption — they expand it.
CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.
δ-mem augments a frozen full-attention LLM with an 8×8 associative memory state updated by delta-rule learning, applying low-rank corrections to attention at inference time — no fine-tuning required. It reaches 1.31× gains on memory-heavy benchmarks and 1.20× on long-conversation tasks.
NVIDIA's SANA-WM generates 60-second, 720p video from a single image and a camera trajectory — on a single GPU. The open-source 2.6B-parameter model achieves 36× higher throughput than prior open-source world models and ships under Apache 2.0.
Mistletoe (arXiv 2605.14005) demonstrates a stealthy adversarial attack on speculative decoding systems: craft inputs that look normal to the target model but cause the draft model to disagree, collapsing acceptance length and throughput while leaving output quality and perplexity unchanged. The attack exploits the fundamental gap between draft and target distributions that all speculative systems rely on bridging.
Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.
Cactus Compute released Needle, a 26M-parameter MIT-licensed model for on-device function calling that strips out all feed-forward networks from the transformer. The architectural choice is a thesis: tool calling is retrieval-and-routing, not reasoning, and attention is the right primitive for it. The numbers are striking — 6000 tok/s prefill on consumer hardware — even if the playground has rough edges.
NVIDIA's NVlabs released cuda-oxide v0.1.0 on May 7, an experimental compiler that takes standard Rust and emits NVIDIA PTX directly — no CUDA C++, no DSLs, no foreign language bindings. The pipeline goes through a custom rustc codegen backend and a Rust-native MLIR-like IR called Pliron. Alpha-stage and Linux-only, but it signals where NVIDIA thinks GPU kernel development might eventually land.
A University of Washington paper shows a multi-agent loop that generates complete LLM serving systems end-to-end. On standard workloads it matches vLLM; on six specialized scenarios — hybrid architectures, streaming ASR, constrained decoding, multimodal pipelines — it beats it by 1.7× to nearly 6×. The paper surfaces a practical claim: the general-purpose serving stack is a compromise, and specialization can be automated.
Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.
Google ships multi-token prediction draft models for the full Gemma 4 family under Apache 2.0, reporting up to 3x throughput gains. The architecture is tightly coupled — shared embeddings, last-layer activations — which keeps the drafter accurate but limits reuse. MoE variants complicate the picture.
Moonshot AI ships Kimi K2.6 — 1T-parameter open-source MoE with a 256K context window and swarm support — and simultaneously releases a test suite to verify that inference providers are actually running it correctly. The same day, Alibaba closes off Qwen3.6-Max. Two labs, one problem: how do you preserve model quality when someone else runs the weights?
A blog post published April 18 describes a technique for running LLM inference inside a WebAssembly sandbox at near-native GPU speed on Apple Silicon. By overriding Wasmtime's memory allocator to back Wasm linear memory with a Metal buffer via makeBuffer(bytesNoCopy:), the author collapses the Wasm–GPU boundary entirely: 0.03 MB overhead vs 16.78 MB for the copy approach, ~9 ms/token for Llama 3.2 1B on M1, and KV cache snapshots that restore 5.45× faster than recomputing prefill.
Qwen3.6-35B-A3B landed on April 16 under Apache 2.0 — 35 billion total parameters, 3 billion active per token, and a hybrid architecture that alternates Gated DeltaNet linear attention with standard attention blocks. It runs on a laptop, scores 73.4 on SWE-bench Verified, and the architecture is more interesting than the benchmark numbers alone suggest.
Eigen Labs — the team behind EigenLayer Ethereum restaking — launched Darkbloom on April 15: a research-preview decentralized inference network that routes AI requests through idle Apple Silicon Macs with cryptographic privacy guarantees. The node operator genuinely cannot read your prompt. The security model is layered and interesting; the economics are aggressive; the project is very early.
A new paper from a mix of academic and industry researchers identifies why diffusion language models consistently trail their autoregressive counterparts despite strong theoretical properties: they don't agree with what they generate. The proposed fix — Introspective Strided Decoding — lets an 8B DLM match same-scale AR quality while running 2.9–4.1x faster at high concurrency.
A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.
Microsoft released three foundational AI models through Azure AI Foundry on April 2: MAI-Transcribe-1 for speech, MAI-Voice-1 for synthesis, and MAI-Image-2 for generation. These are Microsoft's first internally built foundational models — a quiet but significant signal that the company wants more control over its AI stack than the OpenAI partnership alone provides.
MLPerf Inference v6.0 results show NVIDIA achieved a 2.77x throughput improvement on DeepSeek-R1 since the v5.1 results six months ago — on the same B200 hardware. The gains came entirely from software: disaggregated prefill/decode serving, kernel fusion, pipelined execution, and multi-token prediction. Token cost dropped to $0.30/M. It's a useful reminder that the current inference scaling curve has two axes, and software is doing more work than it gets credit for.
Arcee AI released Trinity Large Thinking on April 1 — the reasoning-optimized variant of their 400B sparse MoE, trained by a 30-person startup on 2,048 Nvidia B300 GPUs. It ranks #2 on PinchBench for agentic tasks at roughly 96% lower cost than the top model, under Apache 2.0. The architecture — 256 experts with 4 active per token — is worth understanding.
Ollama's preview MLX backend replaces direct Metal calls on Apple Silicon with Apple's dedicated ML framework, yielding a 93% decode speedup for Qwen3.5-35B-A3B on M5 chips. The update also adds NVFP4 quantization and a smarter KV cache — including prefix-aware eviction that keeps shared system prompts hot across conversations.
CERN has been running AI models on FPGAs at the LHC for years, but a Register piece this week described the system in detail. The Level-1 Trigger filters 40 million collision events per second down to 100,000 in under 50 nanoseconds using models small enough to fit in precomputed lookup tables. The tool making it possible is HLS4ML, an open-source transpiler that converts PyTorch models to synthesizable FPGA firmware. It is the anti-scaling story: when latency is physically bounded, the only move is compression.