Llama 4 released: I see the shadow of DeepSeek

The release of Llama 4 marks a new stage in AI models, and has striking similarities with the DeepSeek technical report.
Core content:
1. Llama 4 achieves task specialization through three different models
2. The transition from Dense to MoE architecture improves model efficiency and capabilities
3. Performance improvement and cost reduction brought by native multimodal architecture
Llama 4 is released.
https://huggingface.co/meta-llama
But this time, it did not loudly declare that the parameters were "far ahead", but re-arranged through three models:
• Scout: 109B parameters, 17B activations, 16 expert MoE, can be deployed on a single H100, 10M token long context , suitable for document analysis, multi-round dialogue, code and other tasks • Maverick: 400B parameters, 17B activations, 128 experts MoE, 1M tokens Long context : 400B parameters, 128 experts, only two activations for inference. Compared with GPT-4o, the performance is not inferior, and the inference cost is only one-tenth of that • Behemoth: 2T parameters, 288B activations, 16 expert MoEs , not deployed, not open, only used in the training phase, generating training data for Scout and Maverick
One user, one main force, and one teaching. They do not interfere with each other, nor do they try to take on all tasks.
To be honest, when I read this release, I always had the same feeling as when I read the DeepSeek V3 technical report: Embrace MoE, embrace synthetic data
The architectural turn: MoE takes center stage
Lllma 3 is Dense, even the 400B model is Dense; and Llama 4 is a MoE architecture.
(For questions about architecture, recommended reading: "Is big smart?" )
In the past, MoE was more of a "laboratory option". After DeepSeek became popular, many manufacturers began to try to use it in the main model, such as Meta this time. In Llama 4, the model Scout is configured with 16 experts, while Maverick is configured with 128 experts. Only two are activated during inference, with a total of 17B.
To review, DeepSeek is similar in R1 and V3: 671B total parameters, 37B activations, using more controllable computational overhead in exchange for improved model capability density .
Of course, it should be said that MoE is not suitable for all mission scenarios, and there are also training problems such as complex scheduling and expert balance. But it at least opens up a realistic dimension: the way parameters are used is as worthy of design as the number of parameters themselves.
Multimodality: From plug-ins to native
In the Llama 3 era, image input relied on an external encoder and was concatenated with the language model. In the Llama 4 era, images were directly input as tokens and participated in language context modeling.
This means that pictures and texts are not pieced together after the model is built, but are contextual units that are modeled as a whole during training .
The improvement brought by this structure is very direct in task performance:
• Maverick scored 94.4 in DocVQA, surpassing GPT-4o (92.8) • ChartQA reached 90.0, MathVista 73.7, both higher than GPT-4o • The inference cost is only one-tenth of GPT-4o
The native multimodal architecture is also reflected in Scout. Although it is a lightweight model, Scout still performs .
Here I would like to say that DeepSeek V3/R1 has not yet introduced image tokens
Training shift: Big models are a process
Behemoth is the largest Llama4, very powerful, but it is not for external use.
The entire purpose of Behemoth is to generate training data, provide capability demonstrations for Scout and Maverick , and further optimize behavior through lightweight DPO and RLHF. In other words, Meta is no longer obsessed with launching the "strongest model", but chooses to invest the most resources in the training system itself.
This is a bit like:
• OpenAI developed Strawberry to train new GPT • DeppSeek developed DeepSeek-R1-Light to train DeepSeek V3
It is not a deification, but a turn.
In my opinion, Llama 4 does not bring a single breakthrough with the largest parameters and the strongest capabilities. But it responds to the changes that are happening in model design with a more complete system with clearer division of labor:
Scout is deployment, Maverick is delivery, Behemoth is the source of understanding
Rather than a product launch, it is more like an announcement of a course adjustment.