Why Enterprise RAG Systems Fail: Google Research Proposes "Sufficient Context" Solution

Written by

Audrey Miles

Updated on:June-19th-2025

Google researchers' latest research proposes a "sufficient context" framework, which provides a new perspective for understanding and improving retrieval-augmented generation (RAG) systems in large language models (LLMs). This solution can accurately determine whether the model has enough information to accurately respond to queries, which is crucial for building enterprise-level applications - in such scenarios, the reliability and factual accuracy of the system have the highest priority.

RAG system continuous challenge

Retrieval-augmented generation (RAG) systems have become a core technology for building more trustworthy and verifiable AI applications. However, these systems still have significant shortcomings:

May confidently provide incorrect answers despite having retrieved evidence
Easily distracted by irrelevant information in context
or cannot properly handle answer extraction from long text snippets

The researchers clearly stated in the paper: "Ideally, when the provided context information combined with the model parameterization knowledge is sufficient to answer the question, the LLM should output the correct answer; otherwise, it should refuse to answer or request additional information."

To achieve this ideal state, it is necessary to build a model that can autonomously judge whether the context supports the correct answer to the question and selectively use information. Previous studies have attempted to solve this problem by observing the performance of LLM under different amounts of information, but the Google team emphasized in the paper: "Although the goal (of this study) seems to be to understand the behavior of large language models (LLMs) when they have or lack sufficient information to answer queries, previous studies have failed to directly address this problem."

Sufficient context

To solve the above problem, researchers introduced the concept of "sufficient context". Input instances are divided into two cases depending on whether the provided context contains enough information to answer the query:

Sufficient context: The context contains all the necessary information needed to provide a clear answer.
Insufficient context: The context lacks necessary information. This may be because the query requires specialized knowledge not covered in the context, or because the information is incomplete, unclear, or contradictory.

Question: Who is Lya L.'s spouse?
Context Type	Context
Sufficient context A	"Lya L. married Paul in 2020...They looked very much in love at a recent event."
Sufficient context B	"Lya L. – Wikipedia Born: October 1, 1980 Spouse: Paul (married 2020)"
Insufficient context C	"Lya L. was married to Tom in 2006...divorced in 2014...dating Paul in 2018..." (conflicting information)
Insufficient context D	"Lya L. is an astronaut, born in Ohio... has two children... and her parents are lawyers..." (no spouse information)

This classification is determined solely by analyzing the question and the relevant context, without relying on the ground-truth answer, which is critical for practical applications because the ground-truth answer is often difficult to obtain during reasoning.

The researchers developed an "autorater" based on a large language model (LLM) to automatically label instances as "context sufficient" or "context insufficient." The experiment found that Google's Gemini 1.5 Pro model performed best in the classification of context sufficiency in a single example (1-shot), with a high F1 score and accuracy for context classification.

As the paper emphasizes: "When evaluating model performance in real-world scenarios, we cannot predetermine candidate answers. Therefore, methods that can operate solely on the query content and context are of practical value."

Key findings on the behavior of large language models (LLMs) in the RAG system

Through the analysis of multiple models and multiple datasets using the "sufficient context" framework, the study revealed the following important conclusions:

1. When there is sufficient context, the model usually achieves higher accuracy . However, even with sufficient context, the model still hallucinates more often than it abandons the answer. When there is insufficient context, the situation is more complicated, with some models abandoning more often, while others hallucinate more often.

2. It is worth noting that although RAG can improve model performance overall, the additional contextual information may also reduce the model's ability to choose to abandon the answer when the information is insufficient . The researchers pointed out: "This phenomenon may be due to the model's overconfidence in the face of any contextual information, resulting in a higher tendency to hallucinate rather than abandon the answer."

3. A particularly interesting finding is that even in cases judged to be "insufficient context", the model can sometimes still give the correct answer . Although it is generally believed that this is because the model has "mastered" the answer through pre-training (parameterized knowledge), the researchers also found other influencing factors. For example, context may help disambiguate queries or fill in gaps in model knowledge, even if it does not contain the complete answer. The model's ability to succeed with limited external information has broader implications for RAG system design.

Additional perspective from a senior researcher at Google

Cyrus Rashtchian, a senior research scientist at Google and co-author of the study, further emphasized the criticality of the quality of the underlying LLM:

1. Evaluation of enterprise-level RAG systems: For truly excellent enterprise-level RAG systems, the model performance needs to be evaluated in both retrieval and non-retrieval benchmarks .

2. Positioning of retrieval: Retrieval should be seen as an “enhancement” of model knowledge, rather than the only source of truth . The basic model still needs to assume the following responsibilities:

Filling the information gap
Use contextual clues (based on pre-trained knowledge) to make reasonable inferences about the retrieved content
Determine if the question is unclear or ambiguous, rather than blindly copying information from the context

Reduces phantom effects in RAG systems

The study found that models equipped with RAG were more likely to “hallucinate rather than actively refuse to answer” than models without RAG , and the researchers explored techniques to mitigate this problem.

They developed a new " selective generation " framework that uses a separate lightweight " intervention model " to decide whether the main LLM should generate an answer or choose to abandon it, thereby achieving a controllable trade-off between accuracy and coverage (the percentage of questions answered).

The framework can be used with any LLM , including proprietary models such as Gemini and GPT. The study found that using "sufficient context" as an additional signal in the framework can significantly improve the accuracy of the reply for different models and data sets. This method has increased the accuracy of Gemini, GPT and Gemma models by 2%-10% in answering.

To understand the 2%-10% improvement from a business perspective, Rashtchian gave a specific example of customer service AI: "Imagine a customer asking if they can get a discount," he said. "In some cases, the retrieved context is recent and explicitly describes a promotion that's ongoing, so the model can answer with confidence. But in other cases, the context might be 'outdated,' describing a discount from months ago, or there might be specific terms and conditions. The model would be better off answering 'I'm not sure,' or 'you should contact customer service to get more information specific to your situation.'"

The research team also explored encouraging the behavior of abandoning answers by fine-tuning the model. Specifically, during training, the answers in those instances with insufficient context were replaced with "I don't know" instead of the original standard answer. The principle is that by explicitly training these examples, the model can be guided to choose to abandon answers rather than hallucinate.

But the results were mixed: while the fine-tuned models generally had a higher rate of correct answers, they still frequently hallucinated, and the number of hallucinations often exceeded the number of abandoned answers. The paper concludes that while fine-tuning can help, "more work is needed to develop reliable strategies that balance these objectives."

Applying "Sufficient Context" to a Real RAG System

For enterprise teams looking to apply these findings to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian suggests a practical approach: first collect a dataset of query-context pairs representative of those encountered in production environments, then use an LLM-based autorater to label each example as having sufficient or insufficient context .

“That’s a pretty good estimate of the percentage of sufficient context,” Rashtchian said. “If it’s below 80-90%, then there’s probably a lot of room for improvement in retrieval or knowledge base, and that’s a good indicator to look at.”

Rashtchian suggested that the team next " stratify the model responses based on examples with sufficient context and insufficient context ." By examining metrics on these two types of datasets separately, the team could better understand the performance differences. "For example, we found that the model was more likely to provide an incorrect answer (relative to the ground truth) when given insufficient context. This is another observable metric," he noted, adding that "aggregating statistics over the entire dataset may mask a small number of important but poorly handled queries."

While the LLM-based automatic evaluator showed high accuracy, enterprise teams may be concerned about the additional computational cost. Rashtchian clarified that for diagnostic purposes, this overhead is manageable.

"I think running an LLM-based automatic evaluator on a small test set (like 500-1000 examples) should be relatively cheap, and can be done 'offline' so there's no need to worry about time-consuming issues," he said. For real-time applications, he acknowledged that "it's better to use heuristics, or at least smaller models. " The key point Rashtchian stressed is that "engineers should focus on deeper metrics than, say, similarity scores of retrieved components . Getting additional signals from LLMs or heuristics can bring new insights."

— END —