The Rest of the Transformer, Fused
CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.
Read more →
