Gemma 4 12B Goes Encoder-Free
Google DeepMind's Gemma 4 12B discards the conventional encoder-stack approach to multimodal models, feeding raw pixel patches and audio waveforms directly into the LLM backbone through lightweight linear projections. The result fits in 16 GB of RAM, accepts native audio, and fine-tunes as a single unified model.
Read more →
