Physical AI development has a fragmentation problem. Building a synthetic data pipeline for robotics or autonomous vehicles has traditionally meant coordinating several models: a VLM to understand a scene, a video diffusion model to simulate plausible futures, and a separate policy network to generate action sequences. Each model has its own weights, its own API, its own failure modes, and they don’t necessarily agree on what’s physically plausible. NVIDIA’s Cosmos 3, announced at Computex June 1 and available on Hugging Face June 3, is a bet on collapsing that stack.
The architecture is a Mixture-of-Transformers (MoT) with two towers sharing a common backbone. The Reasoner tower is an autoregressive VLM: it processes images, video frames, and text through next-token prediction to understand physical scenes. The Generator tower is a diffusion model for world simulation and action generation. The interesting part is how they’re joined. Input sequences are split into an AR subsequence (reasoning path) and a DM subsequence (generation path), with separate parameter sets but joint attention — AR tokens and diffusion tokens attend to each other within the same forward pass. This is one model with 16B or 64B total parameters, not two models stitched together.
That joint attention is where the claim about coupling becomes concrete. Because the reasoner and generator share context, the reasoner’s understanding of object interactions and physics directly conditions what the generator produces — and generated trajectories can inform subsequent reasoning. In practice, this means you can flip the inference direction: given a video of a manipulation task, output an action sequence; or given an action sequence, predict the resulting video. Both operations use the same weights and the same attention graph.
What separates this from a video generator that also outputs robot coordinates is the action generation capability. Cosmos 3 can produce robot joint angles, gripper positions, and control trajectories directly from observations — this is the data that robotics labs actually need for training physical policies. Generating action-labeled video synthetically is one of the bottlenecks in physical AI; if the same model handles both the video generation and the action labeling coherently, the pipeline shortens considerably.
Two variants are available: Cosmos 3 Nano (16B total parameters, 8B per tower) for workstation inference, and Cosmos 3 Super (64B, 32B per tower) for datacenter-scale synthetic data generation. Both are released openly under OpenMDW 1.1, the Linux Foundation’s model-centric license covering weights, architecture, documentation, and datasets. A few notes on what “open” means here: NVIDIA is releasing five separate components — HuggingFace weights, Diffusers integration, post-training scripts, six synthetic datasets, and the Cosmos Framework — and these don’t necessarily share identical commercial terms. Anyone planning enterprise deployment should read the licenses separately rather than assuming blanket permissiveness.
The six synthetic datasets cover robotics manipulation, warehouse operations, autonomous driving, and a few adjacent domains. Applications outside those verticals still require custom data generation; the datasets are a starting point, not a general-purpose resource.
The open question is whether the joint training actually helps. NVIDIA’s benchmarks claim state-of-the-art on VANTAGE-Bench, PAI-Bench, and Physics-IQ, but those are largely NVIDIA-adjacent evaluations. The more informative comparison would be Cosmos 3 against the best specialized video generation model and the best specialized action prediction model run separately, on the same task. The value proposition is most plausible for synthetic data generation, where consistency between the video and action outputs matters and running three incoherent specialist models is the realistic alternative. Whether it’s also better at any individual subtask is a separate question the benchmarks don’t resolve.
For robotics researchers, this is worth paying attention to regardless of how the specialist comparison turns out. Having one openly-licensed model that understands physical scenes, generates physically-plausible video, and produces action trajectories in a single coherent system lowers the barrier to building synthetic training pipelines substantially. Whether the coupling improves quality or just simplifies engineering is something practitioners will start to find out.
