Llama 4 released: 10M long context, MOE, multi-modality, 2 trillion total parameters SOTA is the highlight

Written by

Iris Vance

Updated on:July-08th-2025

Meta has just released the Llama 4 series of models, which is not only another technological evolution in the field of AI, but also an important step for multimodal intelligence to become practical. From Scout to Maverick, and to the unreleased Behemoth, these models are redefining the boundaries of AI capabilities in their evolution.

The Llama 4 family released by Meta this time includes three models:

1. Llama 4 Scout : This is a lightweight multimodal model with 17 billion active parameters and 16 experts. It is designed for efficiency and ease of use, can run on a single NVIDIA H100 GPU, and supports 10M context windows. This makes it perform well in tasks such as multi-document summarization and code base reasoning.
2. Llama 4 Maverick : This is a more powerful multimodal model with the same 17 billion active parameters, but the number of experts has increased to 128, with a total of 400 billion parameters. Maverick surpasses GPT-4o and Gemini 2.0 Flash in multiple benchmarks and performs on par with DeepSeek v3 in reasoning and encoding tasks.
3. Llama 4 Behemoth : This is a yet-to-be-released "giant" model with 288 billion active parameters and nearly 2 trillion total parameters. As a teacher model, it powers Scout and Maverick through knowledge distillation and performs well on multiple STEM benchmarks.

Llama 4 Scout: A small but mighty multimodal pioneer

Let's start with the Llama 4 Scout. This model is the most "lightweight" player in the Llama 4 series, but its small size does not mean it is weak. On the contrary, its design is very sophisticated and its performance even exceeds many larger models.

Scout has 17 billion active parameters and 16 experts , and adopts a mixed expert architecture (Mixture of Experts, MoE). The core idea of MoE is to let each token only activate a part of the parameters, rather than let all the parameters participate in the calculation. This design not only reduces the cost of inference, but also improves the efficiency and performance of the model. For example, Scout can run on a single NVIDIA H100 GPU (through Int4 quantization), which means that its deployment cost is extremely low and suitable for developers or enterprises with limited resources.

What's even more amazing is that Scout supports a context window of 10M , which is almost a new record in the industry. The larger the context window, the more information the model can process. Imagine that Scout can process the entire code base, multiple documents, and even massive amounts of user activity data at once. This ability makes it perform very well in tasks such as multi-document summarization and code reasoning.

Scout's multimodal capabilities are also worth mentioning. It uses an Early Fusion design to seamlessly integrate text and visual tokens into a unified model framework. For example, in the Image Grounding task, Scout can accurately match the user's question with a specific area in the image. This capability makes it very accurate in tasks such as visual question answering and image description generation.

In addition, Scout's training data is also very rich, including more than 30 trillion tokens, covering text, image and video data. This large-scale data mixture ensures the wide applicability of the model in multimodal tasks. For example, it can generate accurate answers or descriptions when processing image and text inputs.

Llama 4 Maverick: The perfect balance of performance and efficiency

If Scout is a "lightweight player", then Maverick is an "all-round player". It also has 17 billion active parameters , but the number of experts has increased to 128 , with a total of 400 billion parameters ! This makes it perform better when handling complex tasks.

Maverick is also designed based on the MoE architecture, but it has more experts, which makes it more comprehensive in multimodal tasks. For example, Maverick can easily handle image understanding, text generation, and reasoning tasks. It surpasses GPT-4o and Gemini 2.0 Flash in multiple benchmarks, and is even comparable to DeepSeek v3 (a model with more parameters) in reasoning and encoding tasks.

Maverick's training strategy is also very interesting. Meta uses a combination of lightweight supervised fine-tuning (SFT) , online reinforcement learning (RL) , and direct preference optimization (DPO) . The core of this strategy is to dynamically adjust the difficulty of the training data to ensure that the model maintains high accuracy in reasoning, encoding, and mathematical tasks. For example, in the online RL stage, Maverick will prioritize prompt words of medium difficulty and improve performance by continuously screening data. This strategy not only improves computational efficiency, but also makes Maverick perform more balanced in multimodal tasks.

In addition, Maverick's performance-to-cost ratio is also excellent. Its experimental chat model achieved an ELO score of 1417 on LMArena , which shows that it can also provide high-quality responses in dialogue tasks. For developers, this means higher performance at a lower cost.

Maverick's multimodal capabilities are also very powerful. It can process up to 48 images and performs well in visual question-answering tasks. For example, it can combine images and questions to generate detailed explanations to help users better understand the content of the image.

Llama 4 Behemoth: The Unreleased "Behemoth"

Behemoth is the "big brother" in the Llama 4 series, with 288 billion active parameters and nearly 2 trillion total parameters . Although it is still in training, it has already demonstrated amazing performance.

Behemoth is designed to be the "teacher model" of the Llama 4 series, providing strong support for Scout and Maverick through knowledge distillation. The core idea of knowledge distillation is to let a larger model (teacher model) guide the learning of a smaller model (student model) to improve the performance of the latter. Behemoth ensures the efficiency of the distillation process through a dynamically weighted distillation loss function.

In terms of training infrastructure, Behemoth uses a fully asynchronous online reinforcement learning framework. This design significantly improves training efficiency, and the speed is increased by about 10 times compared to the previous distributed training framework. In addition, Behemoth performs well in multiple STEM benchmarks, such as surpassing GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in tasks such as MATH-500 and GPQA Diamond.

Behemoth's training data is also worth mentioning. It uses more than 30 trillion tokens, including text, image, and video data. This large-scale data mixture ensures the wide applicability of the model in multimodal tasks. Although Behemoth has not yet been released, its potential is already full of expectations.

In addition, Behemoth's training process uses FP8 precision , which significantly improves computing efficiency. During training, Behemoth achieved a computing efficiency of 390 TFLOPs/GPU using a 32K GPU, which is a very high level in the industry.

Technical highlights of Llama 4

The technical highlights of Llama 4 are well worth exploring in depth. Here are a few key points:

1. Mixed Expert Architecture (MoE) MoE is one of the core technologies of Llama 4. It significantly reduces the cost of inference by activating only some parameters for each token. For example, of Maverick's 400 billion total parameters, only 17 billion are active, allowing it to run on a single GPU. Another advantage of MoE is that it allows the model to dynamically allocate computing resources among different tasks, thereby improving overall performance.
2. Long context support Llama 4 Scout supports 10M context windows , which enables it to handle extremely complex tasks such as multi-document summarization, code base reasoning, etc. This capability is achieved through interleaved attention layers and temperature scaling during reasoning . Interleaved attention layers support longer input sequences by eliminating the limitations of position embeddings, while temperature scaling further improves the generalization ability of the model by adjusting the distribution of attention weights.
3. Multimodal Capabilities Llama 4's multimodal capabilities benefit from its Early Fusion design. This design seamlessly integrates text and visual tokens into a unified model framework, allowing the model to process both image and text inputs. For example, in visual question-answering tasks, Llama 4 can combine images and questions to generate accurate answers.
4. Efficient training technology Meta has developed a training technology called MetaP , which can dynamically adjust the learning rate and initialization parameters to ensure that the model maintains high performance at different scales. In addition, they also used FP8 precision training to significantly improve computing efficiency. When training Llama 4 Behemoth, Meta achieved a computing efficiency of 390 TFLOPs/GPU through a 32K GPU, which is a very high level in the industry.
5. Data mixing and multi-language support Llama 4 is pre-trained in 200 languages, including more than 1 billion tokens in more than 100 languages. This large-scale multi-language support allows the model to easily cross language barriers and provide more accurate services to global users.

Application scenarios of Llama 4

Llama 4's multimodal capabilities and efficient design give it great potential in many scenarios. Here are some specific application scenarios:

1. Personalized assistant Llama 4 can combine the user's historical activity data to provide more accurate suggestions. For example, in social media, it can analyze the user's interests and behaviors to generate personalized recommendations.
2. Multilingual Support Llama 4's multilingual capabilities make it perform very well in cross-language tasks. For example, in international companies, it can be used as a translation tool to help employees communicate across language barriers.
3. Image understanding In the field of education, Llama 4 can help students understand complex charts or pictures. For example, it can combine pictures and questions to generate detailed explanations to help students better grasp the knowledge points.
4. Code Generation and Reasoning Developers can use Llama 4 to quickly generate code or debug complex problems. For example, in software development, it can analyze the code base and generate optimization suggestions, greatly improving development efficiency.
5. Long document processing Llama 4 Scout's 10M context window enables it to handle extremely complex long document tasks. For example, in the legal or academic fields, it can analyze the entire document and generate a summary to help users quickly obtain key information.

Safety and bias issues

Finally, it is worth mentioning that Meta has added a lot of safety mechanisms to Llama 4, such as Llama Guard and Prompt Guard , which are used to detect and filter harmful input and output. In addition, they have also significantly reduced the bias of the model on controversial topics by improving training data and algorithms. The rejection rate of Llama 4 on political and social topics has dropped from 7% in Llama 3.3 to less than 2%, which is a very big improvement.

Conclusion

The release of Llama 4 is not only a technological leap, but also an important step for multimodal intelligence to become practical. Whether it is the lightweight design of Scout, the all-around performance of Maverick, or the "giant" potential of Behemoth, these models have shown us the infinite possibilities of AI in the future.