Understanding NVIDIA's world model platform Cosmos in one article

Written by

Silas Grey

Updated on:June-28th-2025

In today's era of rapid development of artificial intelligence, new technologies and platforms are emerging like mushrooms after rain. As a giant in the field of science and technology, NVIDIA launched the Cosmos world basic model platform at CES in January 2025. As soon as it was unveiled, it attracted global attention and set off a new wave in the field of artificial intelligence.

|| Background of the Cosmos platform

As artificial intelligence gradually moves from theoretical research to practical application, physical AI systems, such as robots and self-driving cars, face huge development challenges. Training these physical AI systems requires a lot of manpower and resources, and usually requires collecting, labeling, and classifying millions of hours of real-world image data. For example, in order to train an autonomous car that can drive safely in complex road conditions, it is necessary to collect a large amount of driving video data in different scenes and weather conditions, and accurately label the various elements in it. This process is not only costly, but also time-consuming and labor-intensive.

To solve these problems, NVIDIA carefully created the Cosmos platform, aiming to lower the development threshold of physical AI systems through innovative technical means, accelerate its development process, and enable more developers to devote themselves to innovation and practice in the field of physical AI.

Simply put, Cosmos is a world model platform with a series of open-source and open-weight video world models with parameters ranging from 4B to 14B. The purpose of these models is very clear, that is, to generate massive amounts of photo-realistic and physics-based synthetic data for AI systems operating in the physical world, such as robots and self-driving cars, thereby solving the problem of extremely scarce data in this field.

The generated effect is as follows:

|| Core components of the Cosmos platform

1. World Foundation Models

This is the core engine of the Cosmos platform . These models are based on advanced deep learning techniques such as diffusion models and autoregressive models, and have powerful generation capabilities.

Category 1: Diffusion Model

It can generate realistic visual simulations based on text prompts and video prompts. For example, when the developer inputs a text describing "vehicles and pedestrians pass in an orderly manner on a sunny city street", the diffusion model can generate a corresponding realistic scene video, which is suitable for the application scenario of Text2World. Or input a video prompt containing part of the scene, and the model can generate a more complete and coherent scene based on it, that is, the Video2World scene.

The second type: autoregressive model

Generate the future visual world based on video prompts (optionally with text prompts). Assuming a video of the robot's current position and action in the warehouse is input, the autoregressive model can predict the robot's next action and position change in the environment, and generate a visual world simulation for a period of time in the future.

These models are large in scale, ranging from 4 billion to 14 billion parameters, and are divided into three categories: Nano, Super, and Ultra. Nano is suitable for real-time, low-latency reasoning and edge deployment, with relatively few parameters, and can run quickly on resource-limited edge devices; Super is a "high-performance baseline" model that can be directly used for fine-tuning and deployment; Ultra pursues maximum accuracy and high-quality output, providing the best fidelity knowledge transfer for refining custom models.

2. Advanced Tokenizer

The Cosmos platform is equipped with advanced visual taggers, such as the Cosmos Tokenizer. It can convert images and videos into tokens, with an overall compression rate that is 8 times higher and processing speed that is 12 times faster than the current leading tagger. This efficient conversion process enables video data to be processed and understood by the model more quickly and efficiently, greatly improving the efficiency and quality of data processing.

3. Guardrail

In order to ensure the safe and reliable operation of the platform and comply with ethical and legal norms, the Cosmos platform has set up a guardrail model. It can ensure that the model is used "responsibly", prevent adverse consequences such as privacy leakage and bias, and provide a solid guarantee for the stable operation and wide application of the platform.

4. Accelerating the video processing pipeline

With the NVIDIA NeMo™ Curator-driven NVIDIA AI and CUDA® accelerated data processing pipeline, developers can use the NVIDIA Blackwell platform to process, organize and label 20 million hours of video in 14 days, while it would take more than 3 years to use CPU alone. This powerful acceleration capability greatly shortens the data processing time, allowing developers to extract valuable information from raw data faster and accelerate the model training and optimization process.