The KV Cache Is More Compressible Than You Think

Two papers published this week attack the KV cache memory bottleneck from opposite directions: one proposes sharing key and value projections at training time for a 50% cache reduction with 3.1% perplexity cost, the other quantizes stored cache values to 4-bit keys and 2-bit values with no calibration required and throughput above FP16. Together they suggest the cache is far more compressible than inference engineers typically assume.

Read more →