How should large models be used? Most of us are wrong. Microsoft's latest research shows that the more conversations a large model has, the worse its performance

Written by
Jasper Cole
Updated on:June-18th-2025
Recommendation

Microsoft's latest research overturns your conventional understanding of the use of large models.

Core content:
1. The counterintuitive phenomenon of large model multi-round dialogue performance decline
2. Microsoft research confirms the "lost phenomenon" in AI dialogue
3. Industry-university collaboration reveals the dilemma of AI model processing information

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
I was at home during the Spring Festival, when DeepSeek became extremely popular. My friends in the village knew that I was working in the Internet industry, and they all asked me what the big model was and how to use it.
I said you can treat it as a knowledgeable friend. If you have any questions, just ask it directly. If you can't get an answer, then ask it a few more times...
Later, I even summarized a set of methods for communicating with the big model. Whether it is structured prompt words or various condition settings, one of the important points is that in order to be more accurate, you must proceed step by step and have multiple rounds of dialogue with the big model...
I regard this method as my golden rule, and proudly share it with others on various occasions, acting like an expert who has experienced it all...
Until recently, I was slapped in the face...
Microsoft Research recently published a paper that revealed a counterintuitive phenomenon: when we have long, multi-round conversations with AI, they become more and more "confused" and the quality of the answers they give will drop significantly.
Seeing the conclusion, it seems unreasonable, but this phenomenon seems to have been seen somewhere. I believe many friends have encountered it. When we first started chatting with the big model, its answers were not so accurate, but basically in place. But as the conversation deepened, we found that the AI ​​began to repeat what it said before, or gave some inconsistent answers, or even completely deviated from the problem you originally wanted to solve.
This is especially evident in the inference model...
The Microsoft Research study used rigorous scientific methods to confirm the existence of this phenomenon and showed that this is not a problem with individual models, but a common problem with almost all large models. The research team tested 15 mainstream AI models including GPT-4, Claude, and Gemini, and found that their performance in multi-round conversations dropped by an average of 39%.
This means that if an AI can achieve a performance of 90 points in a single-round conversation, it may only be able to maintain a performance of around 55 points in multiple-round conversations.
It’s amazing, isn’t it? Why…
Fortunately, Microsoft's research not only discovered the problem, but also delved into the root cause of the problem.
Background
The research was jointly conducted by Microsoft Research and Salesforce Research, and the paper was published on the preprint platform arXiv in May 2025.
This combination itself represents the authority of the research team in the field of AI. As an important partner of OpenAI, Microsoft has a deep understanding of the practical application of large language models, while Salesforce, as a leader in the field of enterprise services, is more concerned with the performance of AI in actual business scenarios. This research method of combining industry and academia can often produce results that are both theoretically valuable and practically meaningful.
The scale of this study is quite large. The research team conducted more than 200,000 dialogue simulation experiments involving 15 different AI models and 6 different types of tasks. This scale of experiment is not common in the field of AI research, which also shows the importance the research team attaches to this issue and the rigor of the research.
Core Finding: The "Lost Phenomenon" in AI Conversations
The research team found that the AI ​​model faces a dilemma when processing information: when users provide complete and clear instructions at the beginning of the conversation, AI can perform at its best, but when information is gradually revealed in multiple rounds of conversation, AI's performance will drop significantly.
Even the most advanced AI models are not immune to this problem, whether it is OpenAI's GPT-4 series, Anthropic's Claude series, or Google's Gemini series, they all show the same trend. This shows that this problem is not a defect of a particular model, but an inherent limitation of the current large language model architecture.
As we mentioned earlier, the research team conducted more than 200,000 dialogue simulations on 15 top large language models (including Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, etc.): the performance of all models in multi-round dialogues was significantly lower than that in single-round dialogues, with an average decrease of up to 39%.
During the test, the research team also found an interesting phenomenon: the models that performed better in a single round of dialogue did not necessarily have a smaller drop in performance in multiple rounds of dialogue. In other words, there is no necessary connection between the "intelligence" of the model and its ability to maintain stable performance in complex dialogues.
In digging deeper into the reasons for the performance degradation, the research team found two key factors.
One is "capability decline", that is, the best performance of AI in multi-round conversations is lower than that in single-round conversations, but the decline is relatively small, averaging only about 15%.
The second is "reliability degradation", which is the main problem. The performance of AI in multi-round dialogues has become extremely unstable, and the same dialogue may produce completely different results. This instability has increased by more than 100%.
If we compare AI performance to test scores, then in a single round of dialogue, an excellent AI may steadily score 90-95 points; but in multiple rounds of dialogue, the same AI may experience huge fluctuations from 30 to 85 points, and the average score will drop to around 65 points. This instability is very bad for practical applications because users cannot predict what kind of answers the AI ​​will give.
Through a large amount of conversation analysis, the research team summarized the four main reasons why AI gets "lost" in multi-round conversations.
The first is the phenomenon of "premature answering". Like a student eager to perform, AI often tries to give a complete answer before it has collected enough information. These early answers based on incomplete information often contain wrong assumptions, which will affect the subsequent development of the dialogue.
The second is the "answer inflation" phenomenon. When AI finds that its previous answer may not be accurate enough, it does not overturn and start over, but continues to add and modify content based on the original answer. This causes the final answer to become lengthy and complicated, and deviates from the real needs of users. This is like a person explaining a problem more and more complicated, and eventually confuses himself.
The third reason is the "lost-in-the-middle" phenomenon. The research team found that when AI processes long conversations, it often pays too much attention to the beginning and end of the conversation and ignores important information in the middle. This phenomenon is called the "lost-in-the-middle" effect in the AI ​​field, which makes AI unable to effectively integrate all the key information in the conversation.
The last reason is "redundant expression". AI often produces overly detailed responses in multi-round conversations. These lengthy responses not only waste computing resources, but may also contain unnecessary assumptions and speculations, which in turn affect the accuracy and efficiency of the conversation.
The ingenuity and limitations of research methods
It is not easy to scientifically verify the hypothesis that "AI performs worse in multi-round dialogues". After all, multi-round dialogues and single-round dialogues are essentially different tasks, and how to ensure the fairness of the comparison is a key challenge.
The research team designed an ingenious experimental framework that breaks down a complete single-round instruction into multiple "shards" to simulate the process of gradual information disclosure in multi-round conversations.
For example, a complete instruction is "Write a Python function that takes a list of integers as input and returns the difference between the largest and smallest values ​​in the list".
The researchers broke it down into:
Round 1: "Write me a Python function"
Round 2: "This function needs to accept a list of integers as input"
Round 3: "The function should return the difference between the largest and smallest value in the list"
This "fragmentation" simulates the situation in which users gradually provide information in a real conversation.
To ensure the scale and repeatability of the experiment, the research team designed an automated conversation simulation system. This system can simulate multiple rounds of conversations between users and AI, and can control the rhythm and method of information disclosure. Through this automated approach, they can conduct large-scale experiments involving multiple different AI models and task types.
The research team conducted tests on six different types of tasks, including programming, database query, API call, mathematical calculation, data description and document summary. This selection covers both technical tasks and language tasks, which can fully reflect the performance of AI in different fields. More importantly, these tasks have clear right and wrong standards, which is convenient for quantitative analysis.
To quantify the model performance, they defined three key metrics: average performance (P, overall success rate), ability (A, best case performance), and unreliability (U, the gap between the best and worst performance). These metrics help researchers accurately analyze the performance differences of the model in different dialogue settings.
An important advantage of the research is its large scale: more than 200,000 simulated conversations, covering 15 top language models. This ensures the reliability and generalizability of the research results.
Whether it is open source models such as the Llama series, or closed source commercial models such as GPT-4.1, Claude 3.7, and Gemini 2.5 Pro, they all show similar "lost" patterns.
This study also has its limitations.
Although automated conversation simulation ensures the scale of the experiment, it may not fully reflect the actual human-computer conversation situation. The behavior of real users is more complex and diverse, and situations that were not considered in the study may occur.
At the same time, the research mainly focused on analytical tasks, and further research is needed on the performance of creative tasks. After all, the evaluation criteria of creative tasks are more subjective and difficult to conduct large-scale automated testing.
In addition, the research is mainly based on the English environment, and it is not clear whether the same problem exists for AI performance in other languages. Considering the differences in expression and thinking patterns in different languages, this issue deserves further exploration.
The research focuses on pure text conversations, but many AI systems now support multimodal interactions. How AI performs in multiple rounds of conversations when multiple information inputs such as images and audio are input is also a question to be answered.
Despite these limitations, the value of this study is undeniable. It provides important insights into our understanding of the true capabilities of AI. More importantly, this study shows that when evaluating and using AI systems, we cannot rely solely on the results of a single round of testing, but must consider more complex practical application scenarios.
Conclusion: How to prevent AI from getting lost in conversation?
The significance of this research goes far beyond the discovery of a technical problem. It actually reveals a fundamental challenge in the current development of AI. Our understanding of AI capabilities has always been largely based on the performance of a single round of dialogue. Whether it is various AI benchmarks or AI's "magical performance" reported in the media, most of them are based on the results of a single round of interaction. But this research tells us that this evaluation method may seriously overestimate the performance of AI in actual applications.
For AI system developers, the research team tested two possible improvement methods. One is the "review" mechanism, which adds a round summarizing all previous information at the end of the conversation. The other is the "snowball" mechanism, which repeats all previous information in each new round. These methods can alleviate the problem to a certain extent and improve performance by 15-20%, but still cannot reach the level of single-round dialogue.
For model developers, the study shows that simply reducing the temperature parameter (making the output more deterministic) does not significantly improve the reliability problem in multi-round dialogues. The researchers call on LLM developers to prioritize the reliability of the model in multi-round dialogues in future iterations, rather than just improving single-round capabilities.
For ordinary users, the research team also provides two very practical suggestions:
First, if the conversation is not going as expected, it may be more effective to try to start a new conversation rather than continue the current one. This is because once the model gets "lost" in the conversation, continuing the conversation often does not help it find the right direction.
Second, before trying a new conversation, integrate the information from previous conversations. You can ask the AI: "Please help me integrate everything we have discussed so far", and then use this integrated information for the new conversation. This approach can significantly improve the performance of the AI.
These recommendations also explain why many professional users of AI tools (such as developers of the AI ​​programming assistant Cursor) make a habit of "frequently starting new conversations" even if the tool allows the conversation to continue indefinitely.
In the future, solving the "lost problem" in multi-round dialogues may require technical breakthroughs in multiple aspects. This includes better attention mechanisms, stronger context understanding capabilities, more stable reasoning processes, and more effective dialogue state management. Solving these technical challenges will not only improve AI's performance in dialogue scenarios, but also promote the advancement of AI technology as a whole.
Insights from Zhiding AI Lab
Current large language models have made amazing progress in single-round capabilities, able to solve increasingly complex problems, and even surpass most humans in some complex benchmarks, such as mathematics, logic, programming, etc.
But this research suggests that true conversational competence is not just about the ability to answer questions, but also about the ability to maintain consistency and reliability as information is revealed.
From the perspective of cognitive science, it is easy to understand that current AI systems are fundamentally different from human cognition. Humans can naturally integrate scattered information in conversations, build coherent understandings, and constantly adjust their cognitive frameworks as new information is added. However, large language models basically lack this dynamic integration capability. They are more like constantly adding new information rather than truly understanding and reconstructing knowledge.
This is also an important reason why AI cannot currently replace many human jobs.
This study also reveals an important blind spot in the current AI evaluation system. Most evaluation benchmarks are conducted in idealized and simplified environments and cannot reflect the complexity of real usage scenarios, which leads to a disconnect between the model optimization direction and actual needs.
In fact, most of the time, capability has nothing to do with the AI ​​benchmark test scores; the ability to solve real-world problems is the key.
Real AI progress is not just about surpassing humans at specific tasks, but about being able to collaborate with humans in a more natural and reliable way, becoming a truly useful assistant in our daily lives and work.