The Post-Training Agent

Post-training is where LLMs get shaped into useful things. Base models — the pretrained weights that come out of the initial run — are capable but unfocused: they’ll complete a sentence, but they won’t reliably follow instructions, stay on-topic, or refuse harmful requests. The months of work that turn a base model into GPT-something or Claude-something happen in post-training: supervised fine-tuning, reinforcement learning from human feedback, direct preference optimization, and various hybrids. It’s expensive in compute, but even more expensive in researcher time. Someone has to know which datasets to use, which training techniques to combine, how to detect and recover from reward collapse, and when the model is actually done improving.

Hugging Face’s ml-intern, released earlier this week, is an attempt to automate that workflow with an agent. The architecture is a multi-step loop built on smolagents: it browses arXiv and Hugging Face Papers to identify relevant training techniques, traverses citation graphs, finds and inspects datasets on the Hub, writes training scripts (using Hugging Face’s TRL library), launches compute jobs via Hugging Face Jobs, reads evaluation outputs, and decides whether to continue or adjust the recipe. The whole loop can run up to 300 iterations, with a context manager that compacts message history at 170k tokens to avoid blowing the window on long runs. There’s also a “doom loop detector” that watches for repeated tool usage patterns and breaks the cycle — a small piece of engineering that matters a lot in practice, since unconstrained agents tend to get stuck spinning on the same failed action.

The demo result is concrete enough to be meaningful: starting from Qwen3-1.7B (which scores around 10% on GPQA out of the box), the agent produced a training run that pushed the model to 32% on the same benchmark in under ten hours. That’s roughly 22 percentage points of improvement on a scientific reasoning benchmark for a 1.7B parameter model — not a trivial result. The $1,000 GPU credit and Anthropic credits that Hugging Face is offering early users suggests the compute cost isn’t negligible either, but 10 hours of job time on Hugging Face’s infrastructure is well within a realistic experimentation budget.

The default orchestration model is Claude Sonnet 4.5, which is a reasonable choice — capable enough for the reasoning involved in literature review and training recipe design, but not overbuilt for a workflow where most of the real work is tool execution. The sensitive-operations approval workflow (the agent asks before submitting jobs or making destructive changes) is a sensible concession to the reality that unchecked resource spending is a real failure mode.

What ml-intern can’t do is equally worth naming. It operates within existing paradigms: SFT, DPO, GRPO, RLHF — all well-documented techniques with known implementations. If the right approach for a given task requires a novel training method that isn’t in the literature yet, the agent can’t invent it. The benchmark target also has to be something that can be measured automatically; the agent can’t optimize for “better customer service tone” without a way to score it. And the current setup is deeply coupled to Hugging Face’s own infrastructure: the Hub for datasets and models, HF Jobs for compute, TRL for training. That’s not a flaw, exactly, but it means the workflow is less portable than a human researcher who can stand up their own training environment anywhere.

Still, the more plausible critique of this kind of tool is that the knowledge it’s automating — “which papers describe the right technique, which datasets are high quality, how to tune learning rate for this scale” — is exactly the kind of accumulated expertise that takes ML engineers years to build. If ml-intern can reliably transfer a meaningful fraction of that expertise into a repeatable automated process, the implications for the cost of fine-tuning custom models are real. The space from “this benchmark says 10%” to “this benchmark says 32%” used to require an experienced ML engineer and a few weeks. Getting there autonomously in ten hours on commodity cloud compute is a meaningful compression.

The project is available on GitHub and as a Hugging Face Space, with early access compute credits for users who apply.