Training the Compression In: Gemma 4 QAT for Mobile

When Google released Gemma 4 12B last week, the architecture change got the attention: eliminating separate vision and audio encoders in favor of feeding raw patches and waveforms directly into the LLM backbone. The deployment question — how small can this actually run? — got a partial answer this week with the release of QAT checkpoints.

The distinction between quantization-aware training and standard post-training quantization matters more than the terminology suggests. In post-training quantization, you train a full-precision model, then compress it afterward, accepting whatever quality degradation the weight rounding introduces. The model was never asked to be robust to compression; it just happens to tolerate it to some degree. QAT inserts simulated quantization into the forward pass during training — the model’s gradients propagate through quantization-corrupted activations, so the weights that result are learned to function under the conditions they’ll actually face at inference time. The quality gap between QAT and PTQ is real, and it widens as bit depth drops, which is what makes QAT necessary for aggressive mobile targets.

Google released two formats. The Q4_0 format targets laptops and consumer GPU deployments where 4-bit weights are the standard tradeoff point. The more interesting piece is the new mobile-specific format, designed around the actual hardware constraints of phone NPUs rather than generic quantization schemes.

Mobile NPU architecture is not generic. Accelerator memory layouts prefer channel-wise quantization — the quantization grid aligned to channels rather than blocks or tensors — because it matches how the hardware accesses weight memory. Dynamic scaling constant computation at inference time adds NPU latency that doesn’t exist in CPU-based inference; pre-calculating those constants during export removes that overhead. And the specific memory bandwidth constraints of mobile DRAM make the bit depth of the token generation layers — the decoder layers invoked for every output token — the dominant bottleneck in autoregressive generation.

The mobile format attacks that bottleneck directly with 2-bit compression on the token generation layers, keeping core reasoning layers at higher precision. This is an asymmetric bet: 2-bit introduces substantially more rounding error than 4-bit, but the QAT-trained weights are designed to absorb it, and the memory bandwidth reduction for decode is proportional to the bit depth reduction. Prefill (processing the prompt) benefits less from this, since prefill is compute-bound rather than memory-bandwidth-bound at typical context lengths.

The headline result for the Gemma 4 E2B text-only model: under 1 GB of total memory, 2x faster inference on mobile-class NPUs compared to FP16, and quality within a few percentage points of the reference model on standard benchmarks. Checkpoints are available in GGUF and compressed-tensor formats, making them directly usable in llama.cpp and MLX without additional tooling.

The less-than-1-GB threshold matters in a specific way for device integration. Applications running on mobile OSes live in a shared memory environment where the OS can evict resident processes under memory pressure. A model that needs 1.5 GB competes directly with other apps for that headroom; one that fits in under a gigabyte can often stay resident. This is not a theoretical concern — it’s what determines whether a model actually gets deployed in production applications versus remaining a demo.

The pattern across Gemma 4’s releases — encoder-free multimodal architecture, then QAT checkpoints optimized to the hardware characteristics of the deployment target — looks like a coordinated push toward genuine edge usability rather than theoretical compressibility. Getting a multimodal model below 1 GB in a way that’s validated against real mobile hardware is a more concrete milestone than it might appear.