Understanding the 'Lost' Phenomenon in Multi-Turn LLM Conversations | Microsoft's Latest Insights

Written by
Clara Bennett
Updated on:June-09th-2025
Recommendation

The latest research has found that the performance of LLM has dropped significantly in multiple rounds of dialogue, which has a significant impact on the development of the Agent system. Core content: 1. Microsoft and Salesforce Research jointly reveal the "lost" phenomenon of LLM in multiple rounds of dialogue 2. Comparison of the performance of 15 LLM models in single and multi-round dialogues, with an average drop of up to 39%. 3. The fragmented simulation experimental framework proposed by the researchers covers six major task areas and simulates the real-world dialogue process

 
Yang Fangxian
53A founder/Tencent Cloud (TVP) most valuable expert

 

Introduction : Microsoft recently jointly released a study titled "Lost in Conversation" with Salesforce Research, saying that the current most advanced LLM performance will drop significantly in multiple rounds of conversations, with an average drop of up to 39%. This phenomenon is called "lost" in dialogue. The article analyzes the differences in performance of various major models (including Claude 3.7-Sonnet, DeepSeek-R1, etc.) in multiple rounds of dialogue, and also analyzes the root causes of the model's "lost" and effective mitigation strategies. This is very important for developing Agent selection models and is worthy of your careful reading. In the second half of the article, there are open source code and dataset links used by researchers for research purposes.

Multiple rounds of dialogue: AI's strongest model is actually "lost"

 

Comparison of performance of 15 LLM models in single-round (FULL) and multi-round (SHARDED) conversations demonstrates a significant performance decline in multi-round conversations.

Their performance drops significantly when state-of-the-art large language models (LLMs) face multiple rounds of conversations. Average drops as high as 39% . Lost in Conversation, a new research collaboration between Microsoft Research and Salesforce Research, reveals this common but rarely paid attention to issue by simulating 15 top models 200,000 dialogues. The research found that both commercial closed-source models (such as GPT-4.1, Gemini 2.5 Pro) and open-source models (such as the Llama series), cannot escape the "lost" dilemma, which poses a serious challenge to engineers developing Agent systems.

Lost makes reliability plummet by 112%

 

Comparative analysis of Aptitude and Reliability shows that the decline in reliability in multiple rounds of conversations is the main problem.

The researchers divided the performance decline of LLM in multiple rounds of conversations into two parts through innovative indicator decomposition:

  • Aptitude decline (Aptitude) : only 16%
  • Reliability declined (Reliability) : Plunge by 112%

This means that the gap between the best and worst performance of the model has more than doubled. This high degree of unreliability explains why your AI assistant sometimes performs well and sometimes inexplicably "lost" and even with the same problem, the results of multiple attempts may be completely different.

Shape simulation: experimental design of lost model

 

The study covers six task types and examples of sharding instructions, showing how to break down a complete instruction into multiple pieces of information.

The researchers designed an innovative experimental framework called "Shashing Simulation", which decomposes the complete instructions into multiple information fragments (shards), and then gradually discloses them in multiple rounds of dialogue. This approach simulates the conversation process in the real world where users gradually clarify their needs, rather than the scenario in which complete information is provided at once in traditional evaluations. Research covers six major task areas:

  1. 1. Programming (Code)
  2. 2. Database Query (Database)
  3. 3. API calls (Actions)
  4. 4. Math Problems (Math)
  5. 5. Data-to-text generation (Data-to-text)
  6. 6. Multi-document summary (Summary)

The wide coverage of this study ensures that the research results are universally applicable.

Instruction Sharding and Dialogue Simulation Type

 

This figure shows the core experimental design methodology of the study, which is divided into two parts:

  1. 1. On the section (instruction shard) :
  • • Shows how researchers split a complete single-wheel instruction (blue square) into multiple information segments (yellow squares)
  • • This is the basis of the "slicing simulation" experiment in the paper, simulating the scenario where users gradually provide information in multiple rounds of conversations
    • 2. The lower part (dialogue simulation type) :
      • • Showcases five different experimental settings and their information flow methods:
        • FULL : Complete instructions are all provided in the first round (baseline scenario)
        • SHARDED : The instruction is divided into multiple segments and is provided gradually in different rounds (simulating real multi-round dialogue)
        • CONCAT : All shards are provided in the first round, but remain in the form of shards
        • RECAP : Use sharding mode, but add a round to summarize all previous information at the end
        • SNOWBALL : Accumulate all previous information on each round

This graph intuitively explains why multiple rounds of conversations lead to performance degradation and how strategies such as RECAP and SNOWBALL work.

Help you test and improve Agent system

The Microsoft research team has opened the complete code base and datasets of Lost in Conversation research, which provides you with a powerful set of tools to test and improve your own Agent system. This code base contains a complete dialogue simulation framework (simulator_full.py, simulator_sharded.py, etc.), covering a single round of complete instructions, multiple rounds of sharding instructions, and RECAP/SNOWBALL strategy implementation.
Github: https://github.com/Microsoft/lost_in_conversation
HuggingFace: https://huggingface.co/datasets/microsoft/lost_in_conversation

The main features of code base and dataset:

        • • Complete dialogue simulation framework, supporting tests in different scenarios
        • • 600 high-quality manual verified instructions and their sharded versions
        • • Covering six practical scenarios including programming, mathematics, database query

If you are an Agent developer, you can use these resources to test three aspects:

        1. 1. Evaluate the true performance differences of different basic models in multiple rounds of conversations
        2. 2. Verify the actual effect of the information integration strategy you designed (such as RECAP)
        3. 3. Diagnose what types of tasks your Agent system is more likely to be "lost"

The researchers recommend that large-scale tests be performed first after small-scale experiments confirm that the settings are correct, and pay attention to following the rate limits of the API provider. This set of tools may be the most complete LLM information integration capability evaluation tool at present, and it has extremely high reference value for building a truly reliable multi-round dialogue system.

⚠️ The model starts to crash in just two rounds of conversation

 

The results of the progressive sharding experiment prove that even if it is just two rounds of dialogue, the reliability of the model will decrease significantly.

The most worrying finding is that Even in the simplest two-round conversation , LLM's performance dropped significantly. Through "progressive sharding" experiments, the researchers have proved that as long as the conversation involves any degree of information being gradually disclosed (even if it is divided into two fragments), the model will have reliability crashes. This means your Agent system is at risk even when dealing with seemingly simple multi-round conversations, and users may encounter "disorientation" situations where AI assistants do not need to deliberately ask complex questions.

Why the strongest model can also fall apart

The study identified four key factors that caused the model to "lost" through in-depth analysis of the dialogue log:

        1. 1. Premature assumptions : The model tries to answer questions when the information is incomplete and makes a lot of assumptions
        2. 2. Answer swell : Over-reliance on previous (possibly wrong) answers leads to swelling of answers rather than rethinking
        3. 3. Uneven Attention Distribution : Overly focused on the first and last rounds of conversations, while ignoring the information of the middle round
        4. 4. Reply lengthy : Generate too long answers, introduce more irrelevant assumptions and distract yourself

Together these factors cause even the most advanced models to gradually deviate from the right track in multiple rounds of conversations.

Answer the effect of verbose length on performance

 

This table reveals an important finding: short answers are usually more effective than lengthy answers.

        • • The horizontal axis indicates the length of the answer, from the shortest (0-20%) to the longest (80-100%)
        • • The vertical axis shows different task types (code, mathematics, database, etc.)
        • • The values ​​in the table are the performance scores of the model under this task

Key finds:

        • • In most tasks (especially Code, Database, Summary), the shorter the answer, the better the performance
        • • For example, in code tasks, the shortest answer (0-20%) scored 55.3, while the longest answer (80-100%) was only 42.5
        • • Only Actions tasks perform best with medium length (40-60%)
        • • Overall, short answers (0-40%) have significantly higher performance than lengthy answers (60-100%)

This shows that the model generates too long answers will introduce more unnecessary assumptions, resulting in "lost".

Claude 3.7 and DeepSeekR1

In all 15 models tested, Claude 3.7-Sonnet demonstrates the strongest multi-round conversation reliability , with a performance retention rate of 65.9%, ahead of other competitors. Although GPT-4.1 performed better in single-round dialogue, Claude lost minimal losses in transition from single-round to multiple rounds, especially with high levels on Math (85.4→70.0) and Summary (29.3→23.6) tasks.

Applicable suggestions:

        • • If you are developing an Agent that requires complex multi-round interaction, Claude 3.7-Sonnet may be the best choice at the moment
        • • If you are limited to open source models, Llama 3.3-70B (64.2% performance retention) is the most cost-effective solution
        •  

As one of the two specialized reasoning models tested in the study, Deepseek-R1 shows a very distinct "two-sidedness".

Odds of single-wheel dialogue:

        • Programming (Code) task: top performance of 99.4 points
        • • Actions Task: 97.0 points
        • • Math Task: 95.5 points

Disadvantages of multi-round dialogue:

        • • Multi-round performance was only 31.5%
        • • Retention rate is only 47.5%
        • • Almost every task has more than 60% ability loss

The researchers specifically pointed out that although Deepseek-R1 has additional test-time compute capabilities, this did not help it maintain stability in multiple rounds of conversations, indicating that "thinking" is not enough to solve the information integration problem.

Suggestions for Agent developers:

        • • Single-wheel interaction scenario: Deepseek-R1 is a highly competitive choice
        • • Complex multi-round dialogue scenarios: need to be carefully evaluated or used to replace DeepSeekV3

?️Invalid cooling: Uncertainty is not the culprit

The unreliability test results of the model under different temperature settings prove that reducing temperature cannot effectively improve reliability in multiple rounds of conversations.

A common misconception is that reducing the temperature parameters of the model can improve the consistency of multiple rounds of conversations. The researcher specially designed a temperature experiment, and the results showed that:

        • Single-wheel dialogue : Cooling is effective (downloading from 1.0 to 0.0 can reduce unreliability by 50%)
        • Multiple rounds of dialogue : Cooling is almost ineffective (at the temperature is 0.0, the unreliability is still as high as about 30%)

This finding suggests that the root cause of the problem is not randomness, but an inherent flaw in the model's processing of information in multiple rounds of context. Engineers need to pay attention : Simple adjustment of the generation parameters cannot solve the "lost" problem in multiple rounds of conversations.

RECAP Strategy: Improve multi-round dialogue performance

The performance comparison of RECAP and SNOWBALL strategies demonstrates that these methods can effectively alleviate performance degradation in multiple rounds of conversations.

In response to the "lost" problem, researchers tested two possible solutions:

        1. 1. RECAP (Final Review) : Add an extra round before the end of multiple rounds of conversations to summarize all previous information provided by users
        2. 2. SNOWBALL (cumulative recap) : Recap the previous information for each round

The experimental results are significant: the RECAP strategy can improve the performance of GPT-4o from 59.1% to 76.6%, alleviating the performance decline of about 40%.

Practical Suggestions : When designing an Agent system, you can consider adding an information review mechanism at key decision points. Although this does not completely solve the problem, it can significantly reduce risks.

Five practical suggestions for Agent architecture design

Based on the research findings, the following five suggestions can help you design a more reliable Agent system:

        1. 1. Delayed answer generation : Avoid the model making assumptions too early, and by clearly indicating that the model is restrained before collecting sufficient information
        2. 2. Control the length of the answer : Research data shows that the success rate of short answers is significantly higher than that of long answers
        3. 3. Implementation information review mechanism : Summary of known information at key decision points
        4. 4. Use multi-model architecture : Use specialized models to be responsible for information integration and decision-making
        5. 5. Cultivate users to provide complete information : Research shows that providing complete instructions at one time is far better than dispersed instructions

These strategies are combined to build a more reliable Agent system.

Researcher's suggestions

The research results pose severe challenges to LLM developers: the current mainstream evaluation methods focus too much on the capabilities of single-round and fully specified scenarios, while ignoring the reliability of multiple rounds and information gradually clarifying the scenarios (Reliability).

The researchers called on LLM developers to pay equal attention to these two dimensions in future model iterations, and put forward specific standards:

        • • Ideal LLM should maintain similar capabilities in single and multi-wheel settings
        • • Unreliability in multiple rounds of conversations should be less than 15%
        • • These metrics should be implemented at the default temperature (T=1.0)

This transformation will make the next generation of LLM more suitable for building truly reliable conversational Agent systems.

Written at the end

"Lost in Conversation" study reveals the key limitations of current LLM. By choosing the model that best suits you, combining information integration strategies such as RECAP, and following practical advice provided by the paper, you can significantly improve the reliability of the Agent system in multiple rounds of conversations.

Recognizing the problem and taking targeted measures is an important step in building the next generation of reliable Agent systems. When a user says "AI always forgets what I said halfway through", Your system may be an exception to breaking this stereotype .