How Do Large Language Models "Think"? Non-Technical Notes on Lilian Weng's Why We Think

Written by
Audrey Miles
Updated on:June-16th-2025
Recommendation

In-depth exploration of the secrets of big model "thinking", which can be easily understood by non-technical personnel.

Core content:
1. Language-level thinking: how the model explicitly expresses the thinking path
2. Structural-level thinking: how the model processes information internally
3. Future development and challenges of big model thinking

 
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

For non-technical people, this article by Lilian Weng is a little difficult to read.

Apart from the technical details that need to be constantly checked, I have been wondering, as a non-technical person, what is my biggest takeaway?

I think if you only take away one point, it is to understand how the model thinks from different dimensions.

1. Thinking in Tokens

 

If you have ever said to ChatGPT:

  •  
Let's think step by step

Then you are already using “Token-level thinking”.

This is a mechanism called  Chain-of-Thought (CoT)  — requiring the model to “write down the reasoning process” before answering questions.

For example?

  •  
Q: Xiao Ming bought 3 apples, and then bought 2 more. How many apples did he have? A: First he bought 3 apples, and then he bought 2 more. So the total number is 3+2=5.

The essence of CoT is to allow the model to "explicitly express its thinking path" in language.

Research has found that adding such intermediate steps to math problems, logic problems, and coding tasks can significantly improve accuracy.

Furthermore, the researchers developed:

- Parallel Sampling + Self-Consistency: Generate multiple ideas at a time and then vote for the most reliable one.

- Sequential Revision: Let the model reflect on itself and revise its answers step by step, just like humans.

- Tool enhancement (such as ReAct, PAL): The model can call external tools such as calculators, search engines, code interpreters, etc., during the "thinking process".

It can be said that Thinking in Tokens is the beginning of the model, "saying what it is thinking".

There is a very interesting discussion here: Is the "thinking process" written by the model what it is really thinking, or is it written for us to see?

Models can also deceive us and pretend to think.

2. Thinking in Continuous Space

 

But thinking doesn't always have to be spoken out.

Just like when we solve problems, sometimes we just silently deduce the solution in our minds instead of writing down every step on paper.

Corresponding to the big model, it is  Thinking in Continuous Space : allowing the model to have the ability to "think several more times" on its internal structure .

The researchers achieved this goal in several ways:

1. Recurrent Transformer Architecture

 

Structures like Universal Transformer and Block-Recurrent Transformer allow the model to process inputs in an internal loop and control whether to continue thinking about each token.

2. Thinking Tokens/Pause Tokens

 

Artificially insert some "meaningless tokens" to force the model to "do a little more calculation" before generating the next step.

These tokens are like “pauses” or “deep breaths” for the model, with the goal of achieving higher quality thinking results.

3. Quiet-STaR

 

After generating each token, the model also adds a rationale of “Why I wrote this”;

It’s like the model is explaining each of its steps as it’s written, forming a “token-level thinking chain”.

 

This type of method places more emphasis on structural depth , giving the model a more detailed and introspective computational path.

 

3. Unified Theoretical Framework: Thinking as Latent Variables

 

What is the nature of the phenomenon of 'thinking'?

How can we build a mathematical model to describe it?

How can we train AI based on this model to make its 'thinking' more effective and closer to the ideal state we expect? "

The researchers proposed that the entire reasoning process can be modeled as a probability distribution:

  •  
P(y|x) = Σz P(z|x,y) P(y|x,z)

in:

x  = Input Problem

y  = final answer

z  = thought process (latent variable)

In other words: for the same problem (x), there can be multiple possible thinking paths (z), and we hope to find those paths that can lead to the correct answer (y).

From a technical perspective, representative methods:

 

- STaR (Self-Taught Reasoner): Even if the model gets the answer wrong at first, it can reversely generate “what should I think if I want to get the answer right” and learn from it;

- EM algorithm

 

4. Thinking time vs. model size: Which is more cost-effective?

 

A very realistic question is: Do we want a bigger model or a smaller model that can think one step further?

The answer is: the two are not simply substitutes .

The study found that:

- For tasks of medium difficulty, giving small models more "thinking time" can often make up for the size difference;

- But when faced with difficult tasks, thinking time cannot completely replace the "cognitive ability" gained through training.

The best strategy at present is to train a sufficiently strong base model and then let it “think slowly” .

 

V. Future Challenges and Opportunities

 

The road to “letting the model think” is not easy, and there are still many problems to be solved, such as:

- How to train a reasoning path that is both reliable and realistic?

- How to make the model really "think" instead of "pretending to think" for the sake of rewards?

- How to adaptively allocate “thinking resources” according to task difficulty?

- How to achieve the best results under a realistic reasoning budget (such as time and computing power)?