RAG is becoming less and less accurate? You may have overlooked the power of "metadata"

Accurate retrieval is no longer difficult. Master the power of metadata!
Core content:
1. The accuracy of the RAG model when the number of documents increases
2. The key role of metadata in improving RAG performance
3. How to effectively use metadata to improve retrieval accuracy and efficiency
Do you have this kind of trouble too?
Ask the big model a very specific question: "Please tell me how to install software A. "
But it told you solemnlyBSoftwareInstallation steps.
During this process, you may have spent a lot of time parsing and cleaning thousands of documents, feeding them into RAG, but the results are still not ideal.
Why is this happening?
One of the important reasons is that we spend a lot of time building the knowledge base, but ignore a seemingly insignificant part - Metadata.
Simply put, metadata is "data that describes data". For example:
- Document metadata: author, title, document type, creation time, user permission level, ...
- Data metadata: field description, data source, update time, user permission level,...
For example:
In the library, when you look for a book, you not only look at the title, but also the author, publisher, and even your own borrowing permissions.
Metadata is the carrier of this "extra information".
The same logic applies to RAG —— When there are more documents, metadata is needed to locate, filter, and focus the content.
Let’s review RAG.
The full name is Retrieval-Augmented Generation, which means “retrieval enhanced generation” in Chinese.
First retrieve content related to the question from the knowledge base Provide the search results as context to the big model to generate answers
When there are fewer documents, the effect looks good.
But as you accumulate more and more documents, especially when documents with similar contents (such as installation manuals for multiple software) are mixed together, problems arise:
The more files accumulate, the worse the RAG effect.
For example, if the installation instructions for "Product A" and " Product B" are similar, RAG may hit the wrong document.
The root of the problem is that the paragraph content of the document does not clearly indicate to whom it belongs.
Above, you need to add metadata to the document.
- Label "Software Name: A" on "A Software Installation Guide"
- All paragraphs inherit this metadata field
When searching, just add one condition:"Software Name = A", you can directly filter out the content of software B.
Next, when the user asks a question, guide himSelect Metadata Tags, just like @people in chat software, quickly lock the document range, for example: "@software name:A How do I install it?"
Inevitably, users sometimes select multiple metadata.
For example, the user asks:
“Compare the benefits of Product A and Product B.”
Then, these metadata were selected:
- Product Name: A
- Product Name: B
- Document Type: Product Promotion
You might naturally think of "connecting these conditions with AND", but the result is that no documents are hit.
Because no document has the three metadata of "Product A", "Product B" and "Product Promotion" at the same time.
A better approach is to split the metadata into two groups:
[["Product Name: A","Document Type: Product Promotion"],["Product Name: B","Document Type: Product Promotion"]]
This is what is called a "valid metadata combination".
For example, you can agree on a combination strategy like this:
1. Metadata dimensions within each group cannot be repeated (two product names cannot be put in the same group)
2. Each group must cover all dimensions explicitly mentioned in the user's question
Example of code for building a combined strategy:
import itertools
from collections import defaultdict
def calculate_tag_combinations ( tags ):
# 1. Parse tags and group by type
tag_groups = defaultdict( list )
for tag in tags:
tag = "" .join(tag.split())
tag = tag.replace( ":" , ":" )
# Make sure the label is formatted correctly
if ":" in tag:
tag_type, _ = tag.split( ":" , 1 )
tag_groups[tag_type].append(tag)
# 2. Check if there are enough tag types
if not tag_groups:
return []
# 3. Generate all possible combinations
# Use itertools.product to efficiently generate Cartesian products
tag_types = list (tag_groups.keys())
tag_values_by_type = [tag_groups[tag_type] for tag_type in tag_types]
combinations = list (itertools.product(*tag_values_by_type))
# Convert to list format
return [ str ( list (combo)) for combo in combinations]
If you encounter more complex user problems, such as:
“Please compare the technical specifications of Product A with the promotional documentation of Product B.”
Then, the user selected metadata: Product Name: A, Product Name: B, Document Type: Product Promotion, Document Type: Technical Specification
The above rules will give you four combinations, two of which are useless. In this case, it is recommended to extract valid combinations through a large model.
[ ["Product Name: A","Document Type: Technical Specification"], ["Product Name: B","Document Type: Product Promotion"]]
Prompt word example
System promptRoleMetadata combination extraction assistantGoalBased on the question and metadata list entered by the user, analyze the two-dimensional list of matching metadata combinations. ConstraintThe metadata combination must be completely from the metadata list provided by the user, and the original metadata items cannot be added or modified. Each combination must contain all relevant metadata dimensions (such as product name, document type, etc.) explicitly mentioned in the user's question. The number of combinations should be consistent with the number of entities that need to be compared or queried in the question (for example, two combinations need to be generated when comparing two products). WorkflowRead and understand the metadata list provided by the userAnalyze the question entered by the user and identify key entities and requirementsExtract metadata keywords explicitly mentioned in the questionMatch the keywords in the question with the metadata listDetermine the metadata dimensions that need to be combined (single dimension or multiple dimensions)Build metadata combinations according to the problem requirementsCheck whether the combination fully covers the problem requirementsVerify whether the combination completely matches the metadata listOutput the final matching metadata combinationEnsure that the output format meets the requirements of the exampleExample 1: Metadata list selected by the user ["Product Name: X","Product Name: Y","Document Type: User Manual","Document Type: Quick Guide"]The question entered by the userPlease compare the content differences between the user manual of product X and the quick guide of product Y. Matching metadata combinations [ ["Product Name: X","Document Type: User Manual"], ["Product Name: Y","Document Type: Quick Guide"]] Example 2: User-selected metadata list ["Region: East China","Region: South China","Report Type: Sales Analysis"] User-entered question Please analyze the sales analysis reports for East China and South China. Matching metadata combinations [ ["Region: East China","Report Type: Sales Analysis"], ["Region: South China","Report Type: Sales Analysis"]]## Special emphasis is placed on the JSON format of the two-dimensional list input [[""]]. Do not output any other explanatory content. User prompt**User-selected metadata list**{{metadatas}}**User-entered question**{{query}}
Above, you have basically understood how to use metadata to improve RAG effects.
Here are some additional tips to help you better maintain and use metadata.
Metadata Management Recommendations
- Manage field names and field values separately
It is recommended to manage the field names and field values of metadata separately to ensure that all field names are globally unique. The same field name can correspond to multiple field values. Specifically, the field names can be maintained uniformly at the file library level, while the specific field values of each file are maintained at the file level. - Distinguishing between built-in and custom metadata
Built-in metadata: automatically extracted or annotated when files are uploaded, deletion and modification are not allowed, including: file name, file type (.docx, .jpg, .mp4, etc.), uploader, upload time, update time, file source, file size, word count and other information. Custom metadata: supports adding and modifying files as needed after uploading, such as content summary, file category (contract, report, manual, etc.), applicable industry, applicable region, applicable period, and attribution object.
Recommendations for using metadata
In addition to allowing large models to automatically extract valid metadata combinations, it can also provide users with the ability to customize metadata logical relationships (such as AND/OR).
When the user selects two or more metadata, the system automatically prompts you to set AND/OR logic and combine them into valid metadata search conditions. For example:
(Product Name:A AND Document Type:Technical Specification) OR (Product Name:B AND Document Type:Product Promotion)
You can refer to the interaction of dify
For multiple metadata combinations with an OR relationship, it is recommended to search each combination independently and then execute RAG within the scope of each search result, rather than executing it uniformly within the union of all combinations.
This can avoid missing some metadata combinations due to low comprehensive scores, thereby improving the accuracy of retrieval and content coverage.
Above, if the user enters: "Please compare the technical specifications of product A with the technical specifications section of the promotional document of product B."
And selected metadata: "Product Name: A, Product Name: B, Document Type: Technical Specification, Document Type: Product Promotion"
The following is an example of the RAG process for metadata:
User input question + selected metadata
LLM parsing, generating valid metadata combinations
Refine the search knowledge point "Technical Specifications"
Each group searches for a document range separately, and searches for knowledge point semantic vectors for the text segment
Aggregate all result text segments, remove duplicates, merge, and reorder
Provides context to the big model to generate the final answer
Your RAG may be more complex than this, but the logic for adding metadata is the same.
Last words
Metadata is not only descriptive information, but also the cornerstone of knowledge governance in the era of big models.
More and more teams are accelerating the construction of knowledge bases, but ignore the importance of metadata.
Maybe you have spent months building a document, but a simple "attribute annotation" can make your RAG really smart.
Next time you encounter a big model that "does not answer the question", don't rush to criticize the model. Maybe it's the metadata that is not ready yet.
From today on, when you encounter a RAG problem, you might as well ask: "Has the metadata been created?"