Why is it difficult for a large model to use mathematical reasoning to solve a junior high school question?

Written by
Clara Bennett
Updated on:July-11th-2025
Recommendation

The weakness of the big model in mathematical reasoning: the challenge of the first grade math problems.

Core content:
1. Why the big model's performance in mathematical reasoning is not satisfactory
2. Detailed analysis and problem-solving ideas of the first grade math problems
3. Comparison of the problem-solving process of different big models and summary of errors

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

In recent years, the progress of large models in mathematical reasoning has attracted attention, and major manufacturers have claimed that their models have leading mathematical capabilities. However, when faced with a junior high school math problem, we found that the performance of large models is still uneven, and even surprisingly "clumsy".

Topic Information

The following is a question from the winter vacation homework of the first grade of junior high school. You can think about this question by yourself first and see if you can solve it....

Description of the topic: In the laboratory, there are three cylindrical containers A, B, and C on a horizontal table (the containers are high enough), and the ratio of their bottom radii is 1:2:1. Two identical tubes are connected at a height of 5 cm in the container (that is, the bottom of the tube is 5 cm away from the bottom of the container). Now, among the three containers, only container A has water, and the water level is 1 cm high, as shown in the figure. If the same amount of water is poured into B and C at the same time every minute, and the water level of B rises by 5/6 cm after 1 minute of water injection, how many minutes of water injection will it take for the difference in the height of the water levels of A and B to be 0.5 cm.

I believe you have already had your own thoughts in mind. If you think about it carefully, the main points here are the area and volume calculation of the circle and the linear equation of one variable; the additional point is the principle of connected vessels. If you have a general understanding or common sense of these two problems, it should be relatively easy to solve them. Even if you can't consider all the situations, you can at least solve 1-2.

The answer is at the end

After finishing the solution, I threw this question to the big model .

Model problem-solving process

Here I will use the models in my previous article "Use one sentence to understand the basics of these large models" to test. The prompt is also relatively simple, as shown below:

Please solve the following junior high school math problem and give a detailed analysis process. The title is as follows:
----
Description of the topic: In the laboratory, there are three cylindrical containers A, B, and C on a horizontal table (the containers are high enough), and the ratio of their bottom radius is 1:2:1. Two identical tubes are connected at a height of 5 cm in the container (that is, the bottom of the tube is 5 cm away from the bottom of the container). Now, among the three containers, only container A has water, and the water level is 1 cm high. If the same amount of water is poured into B and C at the same time every minute, and the water level of B rises by 5/6 cm after 1 minute of water injection, how many minutes of water injection will it take for the difference in the height of the water levels of A and B to be 0.5 cm.
----

In order to prevent the model from crawling original questions from the Internet, all model chat boxes that can display the closing of online search are closed.

 

Because the reasoning process of most models is too long, it seems too difficult to put it in the article in the form of long screenshots or segmented screenshots, so we simply use screen recording to show it.

 

 

deepseek

 

I recently used the official chat, and it seems like I've returned to a one-on-one conversation. In more than 80% of the scenarios, I can ask the first time, but if I ask again, the server is busy.

So there are a lot of jokes and pictures on the Internet about deepseek  server being busy and asking you to try again later  (Internet pictures, for reference and entertainment only).

It kept rejecting me during the day, but when it tried to pick me up again for the 13th time at night, it started to pay attention to me…


 

 

Thousand Questions on Tongyi

ChatGPT

iFlytek Spark


Thousand Questions on Tongyi

 

contrast

But from the results, deepseek solved one, and the other one was wrong. The other models were basically wiped out...;

Model
result
Summary of Reasoning Errors
deepseek
Solved 1, another one is wrong, and there is another scenario that I haven't thought of
The reasoning process is the longest and time-consuming. He constantly switches and denies himself among various conditions and scenarios. However, the entire reasoning process is relatively detailed, which can help students understand the thinking process.
Chatgpt
wrong
The speed is very fast; the main mistake is that "the water in container C is poured into container B too early", ignoring the fact that "if the water surface of the other container has not reached the nozzle, the open container will not automatically flow water from high to low", so that the water level of container B is "accelerated" to rise, resulting in a shorter final calculated time
Kimi 1.5
wrong
The different stages of water level changes, as well as possible water level limitations, are ignored.
Thousand Questions on Tongyi
wrong
It is assumed that the amount of water injected per minute causes the water level in container B to rise by 5/6 cm, and the amount of water level rise in container A is deduced based on the volume relationship. However, the influence of container C is ignored in this process.
iFlytek Spark
wrong
The dynamics of water level changes are not captured correctly, especially when multiple vessels and time periods are involved. This can lead to solutions to the equations that do not correspond to reality.

When I submitted the correct answer to them again for self-analysis and comparison, Chatgpt only did a simple analysis and did not calculate the question again, which was more in line with the prompt I gave; iFlytek Spark and Kimi 1.5 reanalyzed the whole process and calculated the correct answer. Tongyi Qianwen's answer was the same as deepseek's first answer; and deepseek started to get busy again and again and again ...

 

 

Some thoughts

It may be due to the problem of the questions, or the mathematical calculations involving physical phenomena, or the fact that there are no junior high school questions in the training data sets of these large models, which leads to problems such as lengthy reasoning, calculation errors, and deviations in physical understanding. Perhaps this does not mean that the mathematical ability of large models is worthless, but it means that it still needs to be further optimized in specific scenarios.

For AI researchers, how to make large models more accurate and efficient in mathematical reasoning; the explosion of DeepSeek at the beginning of the year has added more business possibilities and brought a lot of innovation to technology, but when we put aside those overused data sets used to brush the charts, true reinforcement learning and self-reasoning may still have a long way to go.

 

 

Original answer

There are three cylindrical containers A, B, and C (the containers are high enough), the ratio of the bottom radius is 1:2:1, and the water level in B rises by 5/6cm after one minute of water filling. Therefore, the water level in C rises by :

5/6 * 2^2 = 10 / 3 cm

Suppose that after the water is injected for t minutes, the difference in water level between A and B is 0.5cm; The difference in water level between A and B is 0.5cm. There are three situations: