Meta released the Llama 4 series late at night: single card H100 can run, tens of millions of contexts, and 2 trillion "monsters"

Written by

Caleb Hayes

Updated on:July-08th-2025

Zuckerberg finally remembered to release Llama 4. The release schedule that should have been released a long time ago was disrupted by DeepSeek R1, haha?!

Meta has just released the first batch of models of the Llama 4 series. According to the official Twitter account, this release is a complete redesign of the Llama series.

First, let’s focus:

Core changes : Llama 4 adopts a mixture of experts (MoE) architecture and native multimodal training, which is no longer a pure text model like Llama 3. This time, Llama 4 Scout and Llama 4 Maverick were released , and the most powerful Llama 4 Behemoth was also previewed.

Here is a quick summary for everyone:

? Llama 4 Scout:

Positioning : Small-size model with the best performance

Parameters : 17B activation parameters, 16 experts, total parameters 109B

Highlights : Extremely fast, natively supports multimodality, has an industry-leading 10 million+ token multimodal context window (equivalent to processing more than 20 hours of video!), and can run on a single H100 GPU (after Int4 quantization)

? Llama 4 Maverick:

Positioning : Best-in-class multimodal model

Performance : It beats GPT-4o and Gemini 2.0 Flash in multiple mainstream benchmarks, and its reasoning and encoding capabilities are comparable to the newly released DeepSeek v3, but with less than half the number of activation parameters.

Parameters : 17B activation parameters, 128 experts, 400B total parameters, 1 million+ context windows

Cost-effectiveness : It provides the best performance-cost ratio in its class. Its experimental chat version has an ELO score of 1417 on LMArena, ranking second.

Deployment : Can be run on a single host

? Llama 4 Behemoth (preview, in training):

Positioning : Meta's strongest model to date, one of the world's top LLMs

Performance : Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on multiple STEM benchmarks

Parameters : 288B activation parameters, 16 experts, total parameter volume up to 2 trillion ( 2T )

Training details : 30 trillion multimodal tokens were trained on 32,000 GPUs using FP8 precision

Role : Serves as a teacher model for code distillation of the Maverick model

Technical highlights

Native multimodality : All models adopt an early fusion strategy to seamlessly integrate text, image, and video tokens into a unified model backbone

Training process optimization : The post-training process of lightweight SFT → online RL → lightweight DPO is adopted. The developers emphasized that excessive use of SFT/DPO will over-constrain the model and limit the exploration ability of the online RL stage, so it is necessary to keep it “lightweight”.

The secret of super long context (10M+) : The key to this breakthrough is the iRoPE architecture ("i" stands for interleaved layers, infinite)

Core idea : Guide architecture design by pursuing the goal of unlimited context, especially taking advantage of length extrapolation capability - training on short sequences and generalizing to extremely long sequences. The maximum training length is 256K

Specific steps :

• Local Attention layer uses RoPE to handle short context (e.g. 8K), which can be parallelized
• The global attention layer is responsible for processing long context (>8K) and does not use position encoding (NoPE idea), which helps improve extrapolation capabilities
• In order to solve the problem that attention weights tend to be flat and affect reasoning when the context becomes longer, temperature scaling is applied to the global layer during reasoning to enhance long-distance reasoning while maintaining short-context performance. The formula is roughly:xq *= 1 + log(floor(i / α) + 1) * β(i is the position index)

Big brother evaluation:

One regret (Former Kaggle President, Fast AI Founder Jeremy Howard) : Although thankful for the open source, Jeremy Howard also expressed disappointment. Llama 4 Scout and Maverick are both large MoE models that cannot be run on consumer-grade GPUs even after quantization , which is a big loss for the accessibility of the open source community.

Jim Fan (Senior Research Manager at NVIDIA) :

Deployment convenience first : Jim Fan believes that for open source models, especially MoE architecture, ease of deployment is becoming more important than simply pursuing model size. Meta emphasizes that Llama 4 Scout can run on a single H100, which is in contrast to Llama-3 401B (which is powerful but has a low adoption rate), indicating that MoE is a direction that is more in line with the current open source strategy.

Intelligent parameter adjustment MetaP : MetaP is a new technology for intelligently adjusting training hyperparameters. Although there are not many details, he speculates that this may be similar to the Bayesian optimization in Meta's open source Ax framework , which can perform adaptive experiments (such as A/B testing) within a limited experimental budget.

Post-training strategy: heavy RL light SFT/DPO : Llama 4's post-training strategy is to reduce the weight of SFT/DPO and increase the weight of online RL. The reason is that too much SFT/DPO will over-constrain the model and limit its exploration ability in the RL stage.

Self-critical data screening : An interesting technical point is that the earlier checkpoints of the model during training can be used as "critics" to evaluate subsequent models, helping to filter out overly simple training samples/cues, allowing the model to become stronger through continuous screening and learning

Behemoth training details and data challenges : Llama 4 Behemoth is huge in scale (FP8 precision, 32K GPU, 30T tokens training). Since the model is too powerful, ordinary SFT data is too "simple" for it, so up to 95% of the SFT data needs to be trimmed, while the small model only needs to trim about 50%

The technical means to achieve tens of millions of context windows seems "quite simple":

1. Remove some positional encoding : Do not use positional encoding in some attention layers (especially global layers), drawing on the idea of the NoPE (No Positional Embedding) paper
2. Adjust Softmax Attention : Adjust Softmax Attention calculation according to the length of the context

Last words

The reasoning model of Llama 4 is still missing, which is a bit unreasonable. What do you think? After all, Meta is a big company! However, Meta said that this is just the beginning, and there will be more models in the future. The team is working hard to develop them, and especially mentioned the Llama 4 Reasoning model.

In addition, compared to DeepSeek 's MIT open source approach, Llama 4's new license has several restrictions:

- Companies with more than 700 million monthly active users must apply for special permission from Meta, which can be granted or denied at Meta's discretion.

- "Built with Llama" must be prominently displayed on the website, interface, documentation, etc.

- Any AI model created using Llama Materials must begin its name with "Llama"

- Specific attribution notices must be included in any distributed "notice" text file - Usage must comply with Meta's separate Acceptable Use Policy (see http://llama.com/llama4/use-policy...) - Limited permission to use the "Llama" name only for brand-compliant purposes