Where does the intelligence of large models come from?

Written by

Iris Vance

Updated on:July-15th-2025

Introduction

With the emergence of large models and the popularity of AI, general artificial intelligence is gradually becoming possible. Understanding the essence behind machine intelligence has also become a point of curiosity for the public. The essence of machines is: algorithm-driven, computing-driven, and data-driven. Based on the superposition of models, technology may continue to refresh people's cognition.

Why is ChatGPT an AI milestone?
Where does emergent intelligence come from?
How does DeepSeek achieve deep thinking and reasoning?
Where is the next stop for AGI?

1. Transformer Architecture

https://arxiv.org/abs/1706.03762

Transformer is a neural network architecture based on the self-attention mechanism . It efficiently processes sequence data through parallel computing and global dependency modeling, and realizes information encoding and decoding. It is widely used in natural language processing, computer vision and other fields. Its core feature is that it captures long-distance dependencies and supports flexible expansion, becoming the cornerstone of modern deep learning.

2. Emergent Intelligence

https://arxiv.org/abs/2206.07682

Emergent Intelligence: When the system scale reaches a certain level, the system as a whole exhibits some complex behaviors or capabilities that cannot be observed in individual components or small-scale systems. In large models (such as ChatGPT), the emergence of emergent intelligence is mainly related to the following factors:

(1) Expansion of model scale

Increase in the number of parameters : As the number of parameters in neural network models increases (from millions to hundreds of billions), the expressive power of the model is significantly enhanced, and it can capture more complex language patterns and knowledge.
Scale effect : When the model size reaches a certain threshold, it will suddenly show some new capabilities (such as context learning, reasoning ability, etc.). This phenomenon is called "emergence".

(2) Training with massive data

Diverse data : The big model covers a wide range of knowledge areas and language phenomena by training massive amounts of diverse data (such as books, web pages, conversation records, etc.).
Data-driven learning : The model automatically extracts patterns from the data and gradually learns to handle complex tasks.

(3) Self-supervised learning and pre-training

Self-supervised tasks : The model learns the inherent laws of language from unlabeled data through self-supervised learning (such as predicting the next word or mask word).
Pre-training goal : During the pre-training process, the model learns general language representation capabilities, laying the foundation for subsequent emergent capabilities.

(4) In-Context Learning

Few-shot learning : The ability of a model to complete a new task given only a few examples is called “contextual learning”.
Pattern matching : The model infers the rules of the task and generates corresponding output by identifying patterns in the input.

(5) Multi-task learning and generalization ability

Multi-task training : The model is exposed to multiple tasks during training (such as translation, question answering, summarization, etc.), which share common language representation capabilities.
Generalization ability : The model can transfer learned knowledge to new tasks, showing strong generalization ability.

(6) Human Feedback and Alignment

Reinforcement Learning with Human Feedback (RLHF) : Through human feedback, the model learns to generate responses that are more in line with human expectations.
Alignment technology : Models are trained to be safer, more useful, and more aligned with user needs, and this alignment process further improves their performance.

(7) Decomposition and reasoning of complex tasks

Task decomposition : The model can decompose complex tasks into multiple simple steps and solve the problem step by step.
Reasoning capabilities : Although the model’s reasoning capabilities are limited, in some cases it is able to simulate reasoning-like behavior through pattern matching and probability calculations.

3. Deepseek’s Counterattack

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

Recently, Deepseek has been all over the news for its extremely high cost-effectiveness, open source, reasoning performance, and understanding of Chinese information.

Taking DeepSeek-R1 as an example, its reasoning is mainly implemented in the following ways :

Reinforcement learning-based training

Adopting a reinforcement learning framework
DeepSeek-R1 uses the GRPO reinforcement learning framework , with DeepSeek-V3-Base as the base model, and uses reinforcement learning to improve the performance of the model in reasoning tasks. During the reinforcement learning process, the model continuously adjusts its own strategy through interaction with the environment to maximize the cumulative reward.
Exploring the Pure Reinforcement Learning Path
DeepSeek-R1-Zero is the product of DeepSeek's first attempt to use pure reinforcement learning to improve the reasoning ability of language models, focusing on the self-evolution of the model through a pure RL process. It does not rely on supervised fine-tuning (SFT) in the initial stage, and naturally exhibits many powerful and interesting reasoning behaviors during the reinforcement learning process, such as self-verification, reflection, and the generation of long reasoning chains.

Multi-stage training optimization

Add cold start data fine-tuning
To solve the problems of poor readability and language mixing in DeepSeek-R1-Zero and further improve the reasoning performance, DeepSeek-R1 added a small amount of cold start data and a multi-stage training pipeline before reinforcement learning. First, thousands of cold start data were collected to fine-tune the DeepSeek-V3-Base model.
Retraining with supervised data
When the reinforcement learning process is close to convergence, new SFT data is generated and the model is retrained by performing rejection sampling on the RL checkpoint, combined with DeepSeek-V3's supervised data (including writing, factual question answering, and self-awareness). After fine-tuning, the checkpoint continues to perform reinforcement learning to cover prompts in all scenarios, and finally obtains DeepSeek-R1.

Reasoning Pattern Distillation
DeepSeek-R1 explores the possibility of distilling model capabilities into small dense models, using Qwen2.5-32B as the base model and distilling directly from DeepSeek-R1. Distilling the reasoning mode of the large model into the small model enables the small model to have powerful reasoning capabilities, and the performance is better than the reasoning mode obtained directly on the small model through reinforcement learning.

DeepSeek-R1: is a model that focuses on complex operations and logical reasoning. It is designed for complex tasks such as mathematics, code generation and logical reasoning, and is suitable for scenarios such as scientific research and algorithmic trading. DeepSeek V3 is positioned as a general-purpose large language model, designed to handle multiple tasks such as natural language processing, knowledge question answering and content generation, and is suitable for scenarios such as intelligent customer service and content creation.

With the development of modern computers, as data-driven algorithms based on deep learning become dominant, the competition has gradually shifted from algorithm competition to more competition in computing power and data. The popular Nezha has more than 1,900 special effects shots, and each of Ao Bing's 2.2 million dragon scales must be rendered in detail. A single picture carries a large number of dynamic characters, which requires high-performance computing clusters, professional rendering engines and tools, cloud computing and elastic computing power, AI and machine learning, and other technologies to provide computing power support, such as large-scale GPU clusters, combined with CPU for physical simulation, and distributed computing architecture to allocate tasks. It may also use rendering engines and AI accelerated rendering, denoising and other technologies.

In the field of large models, computing power and data are more important. The so-called technological leadership is also temporary. AI may continuously refresh people's cognition based on model empowerment, but data-driven intelligence is also limited by data. For example, if the model lacks timely data or local field data, it often appears to be insufficient in intelligence.