Llama4: up to 2 trillion parameters, native FP8 teacher model, the intelligent computing center that does not support FP8 was hit again!!!

Written by

Caleb Hayes

Updated on:July-08th-2025

Three basic conclusions:

1.llama 4, like deepseek, uses native FP8 training . The computing centers that do not support FP8 are hit again, which further confirms our technical judgment that the next generation computing center is FP4. "The next generation of intelligent computing centers must choose FP8, FP6, and FP4 AI chips." It is still the same conclusion: computing centers that do not support FP8 have been eliminated and entered the residual value processing stage! "Intelligent computing centers will not be in excess, but will become obsolete!!! "

2.llama4's teacher model with up to 2 trillion parameters once again leads the world. The scaling law that accelerates the competition of teacher models has not failed. Stacking parameters can still improve model capabilities . At present, the application of computing power is still limited by the high cost of computing power. How to reduce the cost of advanced computing power is still the top priority ( backward computing power AI chips are of no use , and they should be allowed to go bankrupt and liquidate as soon as possible to start the next game)

3.llama4 is MoE 10M long context, 2 trillion total parameters, multi-modal ; R2, GPT-5, Qwen3, Wenxin-5 will also be similar, focusing on multi-modal, multi-modal calculation is larger, low-precision FP4 FP8 mixed training is more important...

On April 6, 2025, Meta dropped a "technological nuclear bomb" late at night - the open source multimodal large model Llama 4 series.

This release not only made the AI community excited, but also announced the paradigm shift of AI technology from "stacking parameters" to "competing for efficiency" with three major breakthroughs: hybrid expert architecture (MoE) , tens of millions of context windows , and native multimodal fusion!

To be honest, they basically copied the idea of deepseek. Deepseek has indeed influenced the idea of large model training all over the world. This is a successful engineering effort by the Chinese. Thumbs up for this!

The input length is longer. One benefit it brings is that when you input a bidding document of 100,000 to 1 million words, you can choose not to slice it and just input it directly for understanding. Of course, this is very computationally intensive, but at least it can be supported technically.

The benefit of super-large context understanding is that it allows the large model to understand the entire document or book at once, with a deeper understanding and less ambiguity.

Llama 4's five core technical highlights: three major innovations: sparse architecture design (MoE), multi-modal native fusion, and long context optimization

1. Hybrid Expert Architecture (MoE): Sparse Activation and Dynamic Routing. Dynamic Routing uses a gating network to analyze input content (such as code, images, and text) in real time and dynamically selects the most relevant "expert subnetwork" to handle specific tasks. Multimodal experts, with different experts focusing on vertical fields such as programming, mathematics, and vision, improve multi-task performance.

2. Support for ultra-long contexts: interleaved attention and temperature scaling. Interleaved Attention Layers, a variant of Rotational Position Encoding (RoPE), uses a hierarchical attention mechanism to break the position embedding length limit of the traditional Transformer and support tens of millions of token contexts. Temperature scaling during inference dynamically adjusts the distribution of attention weights to alleviate the "over-smoothing" problem of the Softmax function in long sequences and improve the ability to model long-distance dependencies.

In particular, it supports complex tasks such as analysis of entire movie scripts, cross-document knowledge integration, and global reasoning of millions of lines of code base.

3. Native multimodality: Early fusion and cross-modal alignment. Early fusion: Directly map text, images, and videos into vectors in a shared semantic space at the model input layer (rather than later concatenation), to achieve deep interaction between modalities.

4. Efficient training technology: MetaP optimizer and low-precision computing. MetaP hyperparameter optimizer, an adaptive algorithm based on Bayesian optimization, can extrapolate the hyperparameters (learning rate, weight decay, etc.) of trillion-level models from small-scale experiments (such as 10-billion-parameter models), reducing the parameter adjustment time by 90%. FP8 mixed-precision training, using 8-bit floating-point numbers in key layers (such as attention matrix calculations), combined with dynamic scaling factors, improves training speed by 30% while maintaining model accuracy.

5. Multilingual data engineering: language coverage and quality filtering. Language stratified sampling: for 200 languages, divide them into high/medium/low resource groups according to resource richness, and dynamically adjust the training data ratio (such as downsampling high-resource languages to prevent overfitting).