Where does the intelligence of large models come from?

Explore the mystery behind big model intelligence and reveal the technical logic behind AI milestones.
Core content:
1. The essence of big model intelligence: algorithms, computing power and data
2. Transformer architecture and its application in AI
3. The source of emergent intelligence and the importance of self-supervised learning
Introduction
Why is ChatGPT an AI milestone? Where does emergent intelligence come from? How does DeepSeek achieve deep thinking and reasoning? Where is the next stop for AGI?
1. Transformer Architecture
https://arxiv.org/abs/1706.03762
Transformer is a neural network architecture based on the self-attention mechanism . It efficiently processes sequence data through parallel computing and global dependency modeling, and realizes information encoding and decoding. It is widely used in natural language processing, computer vision and other fields. Its core feature is that it captures long-distance dependencies and supports flexible expansion, becoming the cornerstone of modern deep learning.
2. Emergent Intelligence
https://arxiv.org/abs/2206.07682
Emergent Intelligence: When the system scale reaches a certain level, the system as a whole exhibits some complex behaviors or capabilities that cannot be observed in individual components or small-scale systems. In large models (such as ChatGPT), the emergence of emergent intelligence is mainly related to the following factors:
(1) Expansion of model scale
Increase in the number of parameters : As the number of parameters in neural network models increases (from millions to hundreds of billions), the expressive power of the model is significantly enhanced, and it can capture more complex language patterns and knowledge.
Scale effect : When the model size reaches a certain threshold, it will suddenly show some new capabilities (such as context learning, reasoning ability, etc.). This phenomenon is called "emergence".
(2) Training with massive data
Diverse data : The big model covers a wide range of knowledge areas and language phenomena by training massive amounts of diverse data (such as books, web pages, conversation records, etc.).
Data-driven learning : The model automatically extracts patterns from the data and gradually learns to handle complex tasks.
(3) Self-supervised learning and pre-training
Self-supervised tasks : The model learns the inherent laws of language from unlabeled data through self-supervised learning (such as predicting the next word or mask word).
Pre-training goal : During the pre-training process, the model learns general language representation capabilities, laying the foundation for subsequent emergent capabilities.
(4) In-Context Learning
Few-shot learning : The ability of a model to complete a new task given only a few examples is called “contextual learning”.
Pattern matching : The model infers the rules of the task and generates corresponding output by identifying patterns in the input.
(5) Multi-task learning and generalization ability
Multi-task training : The model is exposed to multiple tasks during training (such as translation, question answering, summarization, etc.), which share common language representation capabilities.
Generalization ability : The model can transfer learned knowledge to new tasks, showing strong generalization ability.
(6) Human Feedback and Alignment
Reinforcement Learning with Human Feedback (RLHF) : Through human feedback, the model learns to generate responses that are more in line with human expectations.
Alignment technology : Models are trained to be safer, more useful, and more aligned with user needs, and this alignment process further improves their performance.
(7) Decomposition and reasoning of complex tasks
Task decomposition : The model can decompose complex tasks into multiple simple steps and solve the problem step by step.
Reasoning capabilities : Although the model’s reasoning capabilities are limited, in some cases it is able to simulate reasoning-like behavior through pattern matching and probability calculations.
3. Deepseek’s Counterattack
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
Recently, Deepseek has been all over the news for its extremely high cost-effectiveness, open source, reasoning performance, and understanding of Chinese information.
Taking DeepSeek-R1 as an example, its reasoning is mainly implemented in the following ways :
- Reinforcement learning-based training
- Adopting a reinforcement learning framework
DeepSeek-R1 uses the GRPO reinforcement learning framework , with DeepSeek-V3-Base as the base model, and uses reinforcement learning to improve the performance of the model in reasoning tasks. During the reinforcement learning process, the model continuously adjusts its own strategy through interaction with the environment to maximize the cumulative reward. - Exploring the Pure Reinforcement Learning Path
DeepSeek-R1-Zero is the product of DeepSeek's first attempt to use pure reinforcement learning to improve the reasoning ability of language models, focusing on the self-evolution of the model through a pure RL process. It does not rely on supervised fine-tuning (SFT) in the initial stage, and naturally exhibits many powerful and interesting reasoning behaviors during the reinforcement learning process, such as self-verification, reflection, and the generation of long reasoning chains. - Multi-stage training optimization
- Add cold start data fine-tuning
To solve the problems of poor readability and language mixing in DeepSeek-R1-Zero and further improve the reasoning performance, DeepSeek-R1 added a small amount of cold start data and a multi-stage training pipeline before reinforcement learning. First, thousands of cold start data were collected to fine-tune the DeepSeek-V3-Base model. - Retraining with supervised data
When the reinforcement learning process is close to convergence, new SFT data is generated and the model is retrained by performing rejection sampling on the RL checkpoint, combined with DeepSeek-V3's supervised data (including writing, factual question answering, and self-awareness). After fine-tuning, the checkpoint continues to perform reinforcement learning to cover prompts in all scenarios, and finally obtains DeepSeek-R1. - Reasoning Pattern Distillation
DeepSeek-R1 explores the possibility of distilling model capabilities into small dense models, using Qwen2.5-32B as the base model and distilling directly from DeepSeek-R1. Distilling the reasoning mode of the large model into the small model enables the small model to have powerful reasoning capabilities, and the performance is better than the reasoning mode obtained directly on the small model through reinforcement learning.