In-depth long article | The future of big model reasoning: from "chain thinking" to "tree thinking"

Written by
Caleb Hayes
Updated on:July-12th-2025
Recommendation

A new breakthrough in the reasoning ability of AI big models, a paradigm shift from chain thinking to tree thinking.

Core content:
1. Common mistakes in reasoning problems of AI big models and the reasons behind them
2. From chain thinking to tree thinking: the evolution path of big model reasoning ability
3. The importance of tree thinking to the realization of artificial general intelligence (AGI)

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Introduction: Can AI also do brain teasers?

Do you remember the "brain teasers" you played when you were a kid? These seemingly simple questions often make people rack their brains and find the answers difficult to solve. Nowadays, even large models with vast amounts of knowledge and powerful computing power will inevitably fail on some seemingly simple reasoning problems. Is this a limitation of AI, or do we have too high expectations of AI? Or is there perhaps a truth behind this that we have not yet touched upon?

Recently, researchers discovered that the latest inference optimization model DeepSeek-R1 [1] made a ridiculous mistake when solving a variant of a classic puzzle.

The puzzle is this: "Four people need to cross a bridge within 17 minutes. They need 1 minute, 2 minutes, 5 minutes and 10 minutes to cross the bridge respectively. The bridge can only carry two people at a time, and the speed depends on the slower person. They only have one flashlight, which must be used to cross the bridge." The researchers slightly modified the title and simplified it to "Only the slowest person determines the total time." Surprisingly, DeepSeek-R1 still generated a complex solution path for the original problem, and ultimately came up with the wrong answer of 17 minutes, rather than the simplified correct answer of 10 minutes.

Economics professors have also encountered similar situations. They found that when GPT-4 [2] is faced with questions such as "free coffee leads to an increase in the number of people queuing" in economics tests, it often ignores the fixed queue size assumption implied in the question and draws conclusions that are inconsistent with the real world.

These cases seem to suggest that AI's "intelligence" is not as reliable as we thought. So, what went wrong with these "smart" big models?

Perhaps, the research by Algaba et al. in 2025 [3] can help us clear up the fog. They used the o1-mini and o3-mini series models to evaluate the Omni-MATH dataset, a Mathematical Olympiad-level dataset, and found that even if a hundred times more computing resources were invested, it would not necessarily result in an equal performance improvement. As shown in Figure 1 below, the o3-mini (h) model consumed more than 50,000 tokens (hundred times more computing resources than o3-mini (m)), but only achieved a 4% improvement in accuracy. More worryingly, as the inference chain grows, the accuracy of all models generally decreases - for every 1,000 tokens added to the o1-mini, the accuracy decreases by 3.16%.

Why do more powerful AI models make mistakes on seemingly simple problems? Is this the limitation of AI, or do we have too high expectations of AI? Or is there a truth behind this that we have not yet touched upon?

Today, we will analyze the latest research results to explore the evolution path of large model reasoning capabilities - the paradigm shift from "chain thinking" to "tree thinking", which is not only related to the development of AI technology, but also a key step towards artificial general intelligence (AGI). It makes us believe that AI can not only be a tool for solving problems, but also a partner in exploring the unknown world, and even a mirror to help us better understand our own way of thinking. As Alan Kay said: "The best way to predict the future is to create the future", and embracing "tree thinking" may be a key step in creating the future of AGI.

Current situation: the dilemma of "chain thinking"

What is Chain-of-Thought (CoT)?

You can imagine Chain-of-Thought (CoT) as a straight river, and the model is like a small boat that can only move forward along the direction of the river, unable to go upstream or turn to other tributaries.

The core concept of CoT is to let the model, like a human solving a problem, go through a series of intermediate reasoning steps and gradually come to the final answer. This method is like providing a "digital draft paper" for the language model, allowing it to record and organize its own thinking process. CoT significantly improves the model's performance in mathematical calculations, common sense reasoning, and complex problem solving.

However, this seemingly powerful technology also has limitations that cannot be ignored.

First, CoT is a "one-way street". Once the model makes a mistake in one step of the reasoning process, all subsequent steps will continue to be deduced based on this error, resulting in a chain reaction of "one mistake to the end". Just like the navigation system chooses the wrong first intersection, no matter how accurately it navigates, it will not reach the correct destination.

The following Mermaid diagram shows the linear reasoning model of CoT:

Secondly, CoT lacks the ability to explore different paths to solve problems. When faced with problems that require multi-angle thinking or have multiple solutions, the model can only move forward along a single path. It cannot try different ideas when encountering obstacles like humans do, just like being able to choose only one path in a maze.

Finally, the CoT reasoning process relies heavily on the model’s own knowledge reserves and is unable to effectively utilize external knowledge. When encountering knowledge blind spots, the model cannot realize its own shortcomings and actively seek supplementary information like humans do.

Data and cases: CoT's "Waterloo" moment

Recent studies have revealed that CoT has a huge difference in performance in different types of reasoning tasks. In mathematical tasks, CoT can improve the model accuracy by 39% on average, but in common sense tasks that require multi-step reasoning, this improvement is only 4-18%. More worryingly, in some complex tasks, CoT can even lead to a decrease in accuracy.

In a 2025 study [4], Algaba et al. used the o1-mini and o3-mini series models to evaluate the Omni-MATH dataset, a Mathematical Olympiad-level dataset. The results showed that as the reasoning chain grew, the accuracy of all models generally decreased. This phenomenon is called "overfitting thinking" by researchers - the model does not generate an optimized solution path, but a purposeless pile of text.

In the field of rule logic, the performance of CoT is even more worrying. When the rule explicitly requires "terminate the evaluation if condition X is not met", the model will continue to analyze the subsequent conditions, resulting in risk assessment errors. This "over-thinking" phenomenon may lead to serious consequences in the fields of finance and law. Research believes that this is strongly related to the model's probability generation mechanism - the model tends to complete the full output rather than strictly execute the logical interruption.

A deeper analysis showed that about 30% of the CoT steps were verified as "pseudo explanations" - the model may deduce answers through text coherence rather than real logic. This phenomenon of "disconnection between explanation and reasoning" seriously affects the reliability of the model in high-risk decision-making scenarios.

In programming tasks, although o3-mini ranked in the top 0.2% in the world in the Codeforces competition, it was fragile when dealing with unstructured or novel algorithmic problems. When asked to design a recursive algorithm based on a fictitious data structure, the model failed nearly 100% of the time due to the lack of similar patterns in the training data. User feedback pointed out that LLMs are good at combining known code modules, but their performance is even worse than that of ordinary developers when faced with creative tasks that require building logic from scratch.

These cases clearly show that CoT, as the mainstream method for large model reasoning, is facing severe challenges. We need a new reasoning paradigm to break through these limitations.

The solution: the rise of "Tree-of-Thoughts" (ToT)

What is Tree-of-Thoughts (ToT)?

If chain thinking is a one-way road, then Tree-of-Thoughts (ToT) is like a well-connected transportation network that allows the model to explore multiple possible paths at the same time and turn back to try other possibilities when encountering a dead end.

The core concept of ToT is to decompose the problem into multiple steps and explore multiple possible options in each step to form a "tree-like" reasoning structure. This method can not only avoid the risk of "making a mistake all the way", but also better deal with complex problems and uncertain situations.

The following Mermaid diagram shows the parallel reasoning mode of ToT:

The advantages of ToT lie in its parallel exploration capability, which can consider multiple solutions at the same time; its flexible adaptability, which can dynamically adjust strategies based on feedback from intermediate steps; and its stronger knowledge fusion capability, which makes it easier to integrate external knowledge to expand the model's thinking boundaries.

How does ToT "think"?

To intuitively understand how ToT works, let’s take a concrete example to see how it handles complex geometric proof problems.

In a Tier 4 difficulty question in the Omni-MATH dataset, the model needs to prove: "The perpendicular medians of any two skew edges of a regular octahedron must intersect on a straight line, and the straight line coincides with the symmetry axis of the regular octahedron."

The traditional CoT method will reason along a single path, and may choose a method based on vector calculation, but if an error occurs in the process (such as improper setting of the coordinate system), the entire proof will fail. ToT adopts a completely different strategy:

First, the state generator of ToT creates multiple possible proof paths:

  • Path A: Based on symmetry analysis, assuming that the axis of symmetry is the common intersection line
  • Path B: Establish a coordinate system and verify the plane intersection through vector calculation
  • Path C: Using topological methods to prove the existence of intersection lines

The evaluation function then performs a preliminary evaluation of these paths. In this case, path B received the highest score (88/100) because it provides the most direct method of verification.

Next, ToT began to explore further along Path B. It established a three-dimensional coordinate system, set the coordinates of the vertices of the regular octahedron, selected the non-coplanar edges, and calculated the equation of the perpendicular median. However, when calculating the intersection line, ToT found a problem: the intersection line obtained using the initial coordinate system did not coincide with the axis of symmetry, which contradicted the requirements of the question.

At this point, the advantages of ToT begin to emerge. Unlike CoT, which will continue to deduce along the wrong path, ToT's evaluation module triggers counterfactual correction and generates a new hypothesis: there is a deviation in the choice of coordinate system, and the symmetry axis of the standard regular octahedron should be the body diagonal.

ToT then adjusted its strategy, corrected the coordinate system, recalculated the intersection equation, and finally verified the conclusion of the question. At the same time, it also retained path C as an alternative, and could switch to topological methods to continue exploring when necessary.

This process demonstrates the core advantages of ToT: multi-path parallel exploration, instant error detection and correction, and dynamic strategy adjustment. It is these features that make ToT perform well in complex reasoning tasks.

Data and cases: ToT's "highlight" moments

Algaba et al. (2025 ) [5] showed that ToT significantly outperformed CoT on multiple benchmarks. In the Omni-MATH Tier 4 geometry problem, ToT improved the accuracy from 73.9% of CoT to 89.2% while reducing token consumption by 33.8%. This data is surprising because it shows that higher accuracy does not necessarily require more computing resources.

The figure below shows the accuracy comparison of OpenAI models (GPT-4o, o1-mini, o3-mini (m) and o3-mini (h)) in the Omni-MATH benchmark test. From this, we can clearly see that as the difficulty of the problem increases, the accuracy of the model generally shows a downward trend, especially on Tier 3 and Tier 4 difficulty levels. This trend is more obvious.

The following table shows the performance comparison between ToT and CoT on different tasks:

index
ToT Framework
CoT Framework
Difference rate
Accuracy
89.2%
73.9%
+15.3%
Average token consumption
23,800
35,200
-32.4%
Counterfactual modifier trigger times
4.2 times/question
1.8 times/question
+133%
Citation rate of cross-domain knowledge
68%
41%
+65.9%
Proof Completeness Scoring
92/100
76/100
+21.1%

Data source: Omni-MATH Tier 4 geometry problem test set (n=127)

In programming tasks, the advantages of ToT are even more obvious. When faced with recursive algorithm design for fictitious data structures, traditional CoT completely fails (pass rate 0%), while ToT generates multiple hypothesis paths through Monte Carlo tree search, raising the task pass rate to 17%. Although this number is still far below the level of human developers, it is already a qualitative breakthrough.

ToT also performs well in logical reasoning tasks. In tasks that require strict logical interruption, such as compliance review, the traditional CoT ignores the rule of "terminating the evaluation if condition X is not met", resulting in an error rate of up to 40%. The ToT framework combined with Three-Hop Reasoning (THOR) reduces this error rate to 20%.

In the multi-hop reading comprehension task, ToT effectively processes key information in long texts through a segmented verification mechanism, and the accuracy rate is improved by 19%. This shows that ToT has obvious advantages in processing long-range dependencies.

Even more impressive is ToT's performance in resource utilization efficiency. Experiments with the Coconut framework show that 67% of tokens in traditional CoT are used to maintain text fluency rather than actual reasoning steps. However, ToT reduces the proportion of invalid tokens from 67% to 32% through continuous thinking space optimization, increases the average backtracking depth from 5.2 layers to 8.7 layers, and improves the critical path discovery rate by 41%.

These data clearly show that ToT not only surpasses CoT in accuracy, but also achieves a qualitative leap in computing efficiency. As revealed by Algaba et al. in their 2025 study [6] , the success of o3-mini lies not in "thinking longer" but in "thinking deeper".

The figure below shows the inference token usage and accuracy of OpenAI models (GPT-4o, o1-mini, o3-mini (m) and o3-mini (h)) on the Omni-MATH benchmark across domains and difficulty levels.

Future Outlook: From ToT to AGI

Going a step further: the evolution of ToT

Although ToT has made significant breakthroughs, it is only an important milestone in the evolution of large model reasoning capabilities, and there is still a broader space for development in the future. According to the latest research trends, the evolution of ToT is mainly concentrated in the following directions:

The first is the deep integration with reinforcement learning. Through the reinforcement learning framework, the model can learn from the reasoning process and continuously optimize the state generator and evaluation function. DeepSeek-R1 [7] has begun to try this path, realizing dynamic adjustment of token budget through Q-learning, reducing 38% of redundant calculations in geometric proof tasks. This approach enables the model to "learn how to think better" rather than just "think more".

The second is the deep integration of external knowledge bases. Traditional ToT still mainly relies on the internal knowledge of the model, and the future development direction is to build a more powerful knowledge retrieval and fusion mechanism. The FrontierMath team maps discrete symbols to continuous probability distributions through graph neural networks, improving the quality of path generation in interdisciplinary problems by 34%. This approach enables the model to more effectively utilize external knowledge and overcome the limitations of knowledge boundaries.

The third direction is the exploration of neural symbolic reasoning. Pure neural network methods and pure symbolic reasoning each have their own advantages and disadvantages, and combining the two may be the best path for the future. Preliminary experiments show that differentiable symbolic reasoning (embedding discrete logic rules into continuous space) can improve the efficiency of geometric proofs by 39% and reduce the symbol-vector conversion loss by 28%. This method is expected to achieve a perfect combination of the flexibility of neural networks and the rigor of symbolic reasoning.

Challenges and Opportunities: The Path to AGI

Although ToT shows great potential, the road to AGI still faces many challenges. The first is the computational cost. ToT needs to explore a large number of reasoning paths, which is computationally expensive. Although studies have shown that ToT can reduce the overall computational cost through more efficient search strategies, its application in resource-constrained environments still faces challenges.

The second problem is the design of evaluation criteria. How to accurately evaluate the quality of each state is still an open question. Algaba et al.'s 2025 study [8] showed that there was a 6.8% difference in judgment between Omni-Judge automated scoring and human experts, which may cause the model to choose suboptimal paths in complex tasks. In the future, more accurate and general evaluation mechanisms need to be developed.

The third challenge is the problem of interpretability. The reasoning process of ToT is relatively complex and has poor interpretability, making it difficult for humans to understand the "thinking" process of the model. This is especially important in high-risk decision-making scenarios, such as medical diagnosis and financial risk assessment.

In the process of exploring these challenges, the research of Algaba et al. [9] also provides us with important references. The figure below shows the distribution of inference tokens of different models on problems of different difficulty levels. By analyzing this data, we can better understand the advantages and limitations of ToT, as well as the directions that need to be focused on in the future.

However, these challenges also bring huge opportunities. ToT is expected to promote the development of AGI and realize smarter AI systems. In the medical field, ToT can shorten the diagnosis time of rare diseases from an average of 3 weeks to 5 days by exploring multiple diagnostic hypotheses in parallel. In industrial design, the feasibility of ToT solutions combined with physical simulators has increased from 42% to 67%, greatly improving design efficiency. In terms of fairness in educational resources, ToT can provide a more personalized learning path, increasing the coverage of support for complex problem solving by 78%.

More importantly, studying the mechanism of ToT helps us understand the way humans think more deeply. When solving complex problems, humans often explore multiple ideas, evaluate different solutions, and go back and rethink when necessary. ToT simulates this process to some extent, providing a new perspective for the development of cognitive science and neuroscience. It makes us believe that AI can not only be a tool for solving problems, but also a partner in exploring the unknown world, and even a mirror to help us better understand our own way of thinking.

Conclusion: Embrace "tree-like thinking" to illuminate the future of AGI

The evolution from "chain thinking" to "tree thinking" is not only a change in the way technology is implemented, but also a fundamental change in the way AI thinks. This change has enabled a qualitative leap in the performance of large models in complex reasoning tasks, paving the way for the development of AGI.

As Algaba et al.’s 2025 study [10] revealed, the success of o3-mini lies not in “thinking longer” but in “thinking deeper.” This discovery overturns the traditional perception that “the bigger the model, the better” and points out a new direction for the development of AI—realizing real breakthroughs in intelligence through smarter algorithms and more efficient reasoning strategies, rather than simply piling up more computing resources.

In the future, the ToT framework will continue to integrate technologies such as reinforcement learning, external knowledge bases, and neural symbolic reasoning to further enhance the reasoning ability of the model. At the same time, we also need to pay attention to challenges such as computational costs, evaluation criteria, and interpretability to ensure that ToT technology can be safely and effectively applied in various fields.

By embracing "tree-like thinking", we can not only build smarter AI systems, but also understand the essence of intelligence more deeply. This is not only a technological advancement, but also a breakthrough in cognitive science. On this road to AGI, every step is full of challenges and hope. Let us look forward to the continuous evolution of AI reasoning capabilities, which will bring more possibilities and opportunities to human society. It will bring us a more convenient life, more efficient work and a broader vision, and also make us look forward to the possibilities of the future.

This is not just a simple pile of calculations. This is a revolution in thinking. And this revolution has just begun.