IBM released Granite 4.1 on April 29, a family of dense language models in 3B, 8B, and 30B sizes under Apache 2.0. The headline claim — that the 8B instruct variant matches or outperforms the previous Granite 4.0 32B mixture-of-experts model — is the kind of thing vendors say often enough that it’s easy to skim past. But the specifics of how IBM achieved it are worth unpacking because they reflect a different set of design choices than most of the recent model releases worth noticing.

Dense over MoE, with a reason

The conventional wisdom in frontier model design has been drifting toward mixture-of-experts architectures because they let you scale parameter counts while keeping inference compute roughly constant. A 32B MoE model activates only a fraction of its parameters per token, so it can match the quality of a dense model while running faster per token. The tradeoff is that MoE models have higher memory requirements (you need all the experts in memory, even if only a few fire) and less predictable latency.

IBM chose dense transformers for Granite 4.1 explicitly for enterprise reasons: predictable token latency and stable costs. For a training pipeline or batch inference job, the variability introduced by expert routing isn’t necessarily a problem. For a production API with SLA requirements, it can be. Whether that justification holds in practice depends on the deployment context, but it’s a coherent position, not just a capability gap dressed up as a feature.

The 8B model’s ArenaHard score of 69.0 beats the old 32B MoE, as does its BFCL V3 tool-calling score of 68.3 against 64.7. Math results are similarly favorable. This doesn’t mean an 8B model is generally better than 32B — it means IBM’s 8B is better than IBM’s previous 32B, which was trained differently. The comparison is primarily a statement about training quality, not model size being irrelevant.

Five phases, one judge

IBM trained on approximately 15 trillion tokens across five distinct phases, each with a different data mixture. Phase one is broad pre-training. Later phases shift toward higher-quality technical, scientific, and mathematical data — standard practice. What’s less standard is how they handled fine-tuning: every training sample was evaluated by an LLM-as-Judge across six quality dimensions before being included. Samples that failed were dropped. This is an aggressive filtering stance that trades data volume for data quality.

The fine-tuning stage was followed by a four-stage reinforcement learning curriculum. The specific detail that stands out is the third stage: a dedicated math recovery phase. IBM found that RLHF — the first RL stage — degraded mathematical reasoning relative to the supervised fine-tuned checkpoint. This is a known phenomenon in RL training: the reward signal for general instruction following doesn’t always reinforce math-specific reasoning patterns, and the model can regress. IBM’s response was to add an additional RL phase specifically targeting math, using reward signals calibrated to mathematical correctness rather than general preference.

The transparency here is useful. Acknowledging that RLHF hurt math and that they had to actively fix it with a subsequent stage is the kind of detail that gets left out of most model announcements. It implies that “we did RLHF” without careful capability monitoring can actively make a model worse in measurable ways, which has implications for anyone running their own fine-tuning pipelines.

Context and tooling

Granite 4.1 supports a 512K token context window, extended through staged training rather than directly targeting 512K from scratch. IBM’s earlier work on efficient context extension suggests this was done with careful degradation testing at short contexts — extended models often show performance drops on standard short-context benchmarks when context extension is applied aggressively.

The models are available on Hugging Face and via watsonx, in base and instruction-tuned variants, with optional FP8 quantization. The Apache 2.0 license means commercial use without restrictions.

Granite 4.1 isn’t trying to compete on raw benchmark rankings with Gemma 4 or the latest open-weight releases from Qwen or DeepSeek. It’s positioning as a production-ready, predictable-cost option for enterprise workloads — a narrower goal, but one that’s coherent given where IBM’s customer base sits. The interesting signal is in the training pipeline details, which suggest that deliberate multi-stage RL with per-stage capability monitoring is worth the engineering overhead, even for a model at 8B scale.