The Rest of the Transformer, Fused

CODA, a new paper from Tri Dao and colleagues, extends FlashAttention's core insight — keep data on-chip, avoid DRAM round-trips — to all the non-attention operations in a transformer block. Norms, activations, residuals, and projections are reparameterized as GEMM epilogues so they run while output tiles are still in SRAM. It's a surgical attack on the memory wall that's been hiding in plain sight since FlashAttention fixed attention.

Read more →

The Price of Looping a Transformer

Two papers published on April 24 together give the most precise picture yet of looped transformer architectures — where the same block is reused across depth instead of stacking unique layers. The first derives a recurrence-equivalence exponent φ = 0.46 from 116 training runs, showing that looping carries a real compute cost. The second proposes Hyperloop Transformers, adding hyper-connections to partially recover from it, and demonstrates that a 579M Hyperloop model outperforms a standard 1B transformer on perplexity and downstream benchmarks.

Read more →