Llama 4 released: I see the shadow of DeepSeek

Written by

Iris Vance

Updated on:July-08th-2025

Llama 4 is released.

https://huggingface.co/meta-llama

But this time, it did not loudly declare that the parameters were "far ahead", but re-arranged through three models:

• Scout: 109B parameters, 17B activations, 16 expert MoE, can be deployed on a single H100, 10M token long context , suitable for document analysis, multi-round dialogue, code and other tasks
• Maverick: 400B parameters, 17B activations, 128 experts MoE, 1M tokens Long context : 400B parameters, 128 experts, only two activations for inference. Compared with GPT-4o, the performance is not inferior, and the inference cost is only one-tenth of that
• Behemoth: 2T parameters, 288B activations, 16 expert MoEs , not deployed, not open, only used in the training phase, generating training data for Scout and Maverick

One user, one main force, and one teaching. They do not interfere with each other, nor do they try to take on all tasks.

To be honest, when I read this release, I always had the same feeling as when I read the DeepSeek V3 technical report: Embrace MoE, embrace synthetic data

The architectural turn: MoE takes center stage

Lllma 3 is Dense, even the 400B model is Dense; and Llama 4 is a MoE architecture.

(For questions about architecture, recommended reading: "Is big smart?" )

In the past, MoE was more of a "laboratory option". After DeepSeek became popular, many manufacturers began to try to use it in the main model, such as Meta this time. In Llama 4, the model Scout is configured with 16 experts, while Maverick is configured with 128 experts. Only two are activated during inference, with a total of 17B.

To review, DeepSeek is similar in R1 and V3: 671B total parameters, 37B activations, using more controllable computational overhead in exchange for improved model capability density .

Of course, it should be said that MoE is not suitable for all mission scenarios, and there are also training problems such as complex scheduling and expert balance. But it at least opens up a realistic dimension: the way parameters are used is as worthy of design as the number of parameters themselves.

Multimodality: From plug-ins to native

In the Llama 3 era, image input relied on an external encoder and was concatenated with the language model. In the Llama 4 era, images were directly input as tokens and participated in language context modeling.

This means that pictures and texts are not pieced together after the model is built, but are contextual units that are modeled as a whole during training .

The improvement brought by this structure is very direct in task performance:

• Maverick scored 94.4 in DocVQA, surpassing GPT-4o (92.8)
• ChartQA reached 90.0, MathVista 73.7, both higher than GPT-4o
• The inference cost is only one-tenth of GPT-4o

The native multimodal architecture is also reflected in Scout. Although it is a lightweight model, Scout still performs .

Here I would like to say that DeepSeek V3/R1 has not yet introduced image tokens

Training shift: Big models are a process

Behemoth is the largest Llama4, very powerful, but it is not for external use.

The entire purpose of Behemoth is to generate training data, provide capability demonstrations for Scout and Maverick , and further optimize behavior through lightweight DPO and RLHF. In other words, Meta is no longer obsessed with launching the "strongest model", but chooses to invest the most resources in the training system itself.

This is a bit like:

• OpenAI developed Strawberry to train new GPT
• DeppSeek developed DeepSeek-R1-Light to train DeepSeek V3

It is not a deification, but a turn.

In my opinion, Llama 4 does not bring a single breakthrough with the largest parameters and the strongest capabilities. But it responds to the changes that are happening in model design with a more complete system with clearer division of labor:

Scout is deployment, Maverick is delivery, Behemoth is the source of understanding

Rather than a product launch, it is more like an announcement of a course adjustment.