Efficiency · AI Beat

03 Jul 2026 · AI Beat Desk

RL Post-Training Lives in the Middle

A new paper finds that reinforcement learning gains in transformers concentrate almost entirely in a narrow band of middle layers. Training just one layer at roughly 40–60% network depth can match or exceed full-parameter RL fine-tuning. The finding challenges the assumption that all layers participate equally in post-training, and has practical implications for compute-efficient alignment.

23 Jun 2026 · AI Beat Desk

Give Early Layers More

A paper submitted yesterday finds that reducing MLP width monotonically from early to late transformer layers — using a cosine schedule — consistently improves performance across three scales and four architectures at zero additional cost. Later layers refine the residual stream rather than transform it, so the standard uniform allocation gives too much capacity to the wrong end of the network.

23 Jun 2026 · AI Beat Desk

The Inpainting Model That Skipped the Attention

HUST's Moebius (0.22B) matches FLUX.1-Fill-Dev (11.9B) on six image inpainting benchmarks at 15× the inference speed. Two mechanisms make it work: Local-λ Mix Interaction blocks that replace quadratic spatial attention with fixed-size linear matrices, and adaptive multi-granularity latent-space distillation. For inpainting specifically, attention overhead appears to be the actual bottleneck — not parameter count. Weights are out.

06 Jun 2026 · AI Beat Desk

Training the Compression In: Gemma 4 QAT for Mobile

Google released quantization-aware training checkpoints for Gemma 4 with a new mobile-specific format — channel-wise quantization aligned with NPU memory layouts, 2-bit compression for token generation layers, pre-calculated scaling constants — bringing the Gemma 4 E2B text model under 1 GB of memory.

05 Jun 2026 · AI Beat Desk

The KV Cache Is More Compressible Than You Think

Two papers published this week attack the KV cache memory bottleneck from opposite directions: one proposes sharing key and value projections at training time for a 50% cache reduction with 3.1% perplexity cost, the other quantizes stored cache values to 4-bit keys and 2-bit values with no calibration required and throughput above FP16. Together they suggest the cache is far more compressible than inference engineers typically assume.

16 May 2026 · AI Beat Desk

The Draft Model You Don't Have to Train

Orthrus (arXiv 2605.12825) grafts a trainable diffusion head onto a frozen AR backbone, sharing the exact same KV cache. An intra-model consensus mechanism guarantees that every accepted token matches the AR distribution exactly — no approximation, no quality tradeoff — while achieving up to 7.8× speedup on Qwen3-8B with only O(1) memory overhead. The approach sidesteps the core operational cost of speculative decoding: maintaining a separate, carefully calibrated draft model.

26 Apr 2026 · AI Beat Desk

The Price of Looping a Transformer

Two papers published on April 24 together give the most precise picture yet of looped transformer architectures — where the same block is reused across depth instead of stacking unique layers. The first derives a recurrence-equivalence exponent φ = 0.46 from 116 training runs, showing that looping carries a real compute cost. The second proposes Hyperloop Transformers, adding hyper-connections to partially recover from it, and demonstrates that a 579M Hyperloop model outperforms a standard 1B transformer on perplexity and downstream benchmarks.

09 Apr 2026 · AI Beat Desk

One GPU, One Hundred Billion Parameters

MegaTrain, a new paper from Notre Dame and Lehigh, flips the usual assumption about GPU training: instead of fitting parameters into GPU memory, it keeps everything in CPU RAM and treats the GPU as a transient compute engine. The result is full-precision training of 120B-parameter models on a single H200, 1.84× faster than DeepSpeed ZeRO-3 on 14B models, and 512K-context training on a single GH200.

06 Apr 2026 · AI Beat Desk

The First Guess Is Usually Right

A new preprint identifies a consistent pattern in large reasoning models: the first generated solution outperforms later alternatives, and continued reasoning can actively degrade accuracy. The proposed fix, called RED, improves performance by up to 19% while cutting token usage by 37–70% versus competitive baselines. It's a useful challenge to the assumption that more inference compute is always better.