Most model announcements lead with benchmark numbers and bury the method. MiniMax M2.7, released yesterday on HuggingFace, is worth reading in the opposite order — start with the process, then decide whether the numbers are credible.
The architecture is a sparse Mixture-of-Experts: 229 billion total parameters, 10 billion active per forward pass, 256 local experts with 8 activated per token. 62 layers, hidden size 3072, RoPE and QK RMSNorm, 204K context. That configuration is competitive — it puts inference cost closer to a 10B dense model while (in principle) drawing on a much larger parameter pool for any given input. The 10B active params is similar to what Z.AI’s GLM-5.1 uses in its much larger 754B MoE, which we covered last week. MoE at this activation ratio is becoming the default architecture for open-weight models with real-world deployment ambitions.
The benchmarks are strong for an open-weight model: 56.22% on SWE-Pro (MiniMax says this matches GPT-5.3-Codex), 57.0% on Terminal Bench 2, 55.6% on VIBE-Pro (a full project delivery benchmark), 66.6% medal rate on MLE Bench Lite across 22 ML competitions run autonomously over 24-hour windows. On GDPval-AA, a professional task delivery benchmark, it posts an ELO of 1495 — MiniMax claims the highest score among open-source models. Those are real numbers on benchmarks that test sustained agentic execution, not just one-shot generation. Whether they hold up in varied deployment contexts is something practitioners will establish quickly.
But the part that’s worth spending time on is the self-evolution loop.
During development, MiniMax gave an internal M2.7 instance its own programming scaffold — the agent harness, tool definitions, and evaluation setup used for its own training — and let it run. Over more than 100 rounds, the model executed a cycle: analyze which trajectories failed, plan scaffold modifications, implement changes, run evaluations, compare results, decide to keep or revert. No human directed each step. The outcome was a 30% performance improvement on their internal coding benchmarks.
This is worth being precise about. It is not the model retraining itself or updating its own weights. It’s the model acting as an agent on the infrastructure around it — the scaffolding, the tooling, the prompt structure, the evaluation harness. That’s a narrower claim than self-modification, but also a more interesting one in practice: the scaffold is often what determines whether a model is useful on a hard task, and writing good scaffolding is tedious work that most teams underinvest in. An agent that can iteratively improve its own harness against a concrete objective is doing something genuinely useful, and 100+ rounds of autonomous iteration on agentic infrastructure with measurable improvement is not a trivial result to demonstrate.
The loop structure — analyze failures → plan → modify → evaluate → keep or revert — is also a reasonable prototype for what robust agent-based development looks like. GLM-5.1’s “staircase” pattern was about goal persistence across long task chains; M2.7’s self-evolution loop is about iterative improvement of the development environment itself. These are different capabilities, and it’s interesting that both appeared in the same week.
On licensing: the HuggingFace page says “Modified MIT,” and there’s already an active community discussion pointing out that the actual license restricts commercial deployment without prior written authorization from MiniMax. That’s not MIT in the way most people use the term. The weights are publicly available and usable for research; commercial use requires contacting MiniMax directly. Worth knowing before you build something on top of it.
MiniMax M2.7 on HuggingFace. The technical announcement covers the self-evolution methodology in more detail, including the specific scaffold components the model modified.
