Quantization · AI Beat

15 Jul 2026 · AI Beat Desk

A 27B Model in 3.9 Gigabytes

PrismML released Bonsai 27B on July 14: 1-bit binary and ternary builds of Qwen3.6-27B that fit in 3.9 GB and 5.9 GB respectively, run at 11 tok/s on an iPhone 17 Pro, and retain over 90% and 95% of full-precision benchmark performance. The compression factor is around 14× versus FP16, and the models are available under Apache 2.0.

09 Jun 2026 · AI Beat Desk

A Trillion Parameters at a Thousand Tokens Per Second

Xiaomi and TileRT published MiMo-V2.5-Pro-UltraSpeed on June 8, pushing a one-trillion-parameter model past 1000 tokens per second on a single standard 8-GPU node — no custom silicon, just three carefully chosen co-design decisions applied to a commodity cluster.

06 Jun 2026 · AI Beat Desk

Training the Compression In: Gemma 4 QAT for Mobile

Google released quantization-aware training checkpoints for Gemma 4 with a new mobile-specific format — channel-wise quantization aligned with NPU memory layouts, 2-bit compression for token generation layers, pre-calculated scaling constants — bringing the Gemma 4 E2B text model under 1 GB of memory.

05 Jun 2026 · AI Beat Desk

The KV Cache Is More Compressible Than You Think

Two papers published this week attack the KV cache memory bottleneck from opposite directions: one proposes sharing key and value projections at training time for a 50% cache reduction with 3.1% perplexity cost, the other quantizes stored cache values to 4-bit keys and 2-bit values with no calibration required and throughput above FP16. Together they suggest the cache is far more compressible than inference engineers typically assume.

01 Jun 2026 · AI Beat Desk

Image Generation at 1 Bit

PrismML's Bonsai Image 4B applies 1-bit and ternary quantization to a FLUX.2 Klein diffusion transformer, compressing it 8.3× to 0.93 GB — small enough to generate images on an iPhone in under 10 seconds. It's the first demonstration that extreme quantization techniques developed for language models transfer cleanly to diffusion architectures.

08 May 2026 · AI Beat Desk

One Model, One Chip, No Framework

Salvatore Sanfilippo (antirez, Redis) released ds4: a single-model Metal inference engine for DeepSeek V4 Flash that deliberately rejects the general-framework approach. Asymmetric 2-bit quantization on MoE experts only gets a 280B-parameter model into 128 GB RAM with 26–36 t/s generation, 1M-token context, and disk-persisted KV cache on Apple Silicon.

26 Apr 2026 · AI Beat Desk

The Price of Looping a Transformer

Two papers published on April 24 together give the most precise picture yet of looped transformer architectures — where the same block is reused across depth instead of stacking unique layers. The first derives a recurrence-equivalence exponent φ = 0.46 from 116 training runs, showing that looping carries a real compute cost. The second proposes Hyperloop Transformers, adding hyper-connections to partially recover from it, and demonstrates that a 579M Hyperloop model outperforms a standard 1B transformer on perplexity and downstream benchmarks.

01 Apr 2026 · AI Beat Desk

One Bit All the Way Down

PrismML launched Bonsai on March 31, claiming the first commercially viable true 1-bit LLMs: an 8B model that fits in 1.15 GB and runs at 131 tokens/sec on an M4 Pro. The key word is "true" — every layer, including embeddings and attention, is 1-bit, not just the weights in isolation.