RAG's solution to improve recall accuracy - Research on similarity calculation and Rerank reordering

Solve the RAG recall accuracy problem and explore similarity calculation and Rerank re-ranking strategies.
Core content:
1. The importance of RAG document processing and recall strategy
2. The impact of document format complexity caused by business needs on recall effect
3. Similarity calculation problem and Rerank re-ranking solution
“ The way documents are split in RAG directly affects the recall effect, so document processing and recall strategies in RAG are equally important. ”
I recently encountered a problem when doing RAG recall. We used the milvus vector database in the project; the specific requirement was to import the data in the excel table into the milvus database, and then use the search method to recall the data from the vector database.
Due to business requirements, the vector table (collection) not only stores Excel data, but also documents in word, markdown and other formats. Therefore, the schema in the collection table cannot be designed according to the columns in the Excel table. All data in the Excel table can only be vectorized and put into the content field, and finally vector retrieval is performed through the content field.
The problem is that the user uses the same data in the Excel table in the query, except for the keywords; but the similarity of the data recalled from Milvus is only more than 70%; while our requirement is more than 80%.
For example, the Excel table contains the regulations of your company, and each piece of data is a regulation; for example, one of the data is about being late or leaving early; for example, if an employee is found to be late or leave early without reason more than three times, three days' salary will be deducted as a punishment.
Then, we use lateness and early departure to recall in queyr, but the similarity of the data of lateness and early departure is only more than 70%.
So, here comes a question, what is the reason for such low similarity? And how to solve this problem?
Similarity calculation and re-ranking
At present, my guess is that the reason for the low similarity is due to document splitting. Since the format of this requirement document is relatively complex and uses data in multiple formats such as Excel, Word, and PDF, it is impossible to save each field in the vector database according to the standard Excel format. The data in the table can only be vectorized as a whole.
This leads to a problem, which is that there are more factors interfering with the data. For example, there may be some data in Excel that is irrelevant to the company's system, but they can only be vectorized together; ultimately, the similarity becomes lower.
This also reflects the importance and necessity of preliminary data processing in RAG; the higher the quality of preliminary document processing, the better the recall effect.
So, how should we solve this problem?
First of all, it is an indisputable fact that the similarity of the data recalled from the vector database milvus is only more than 70%. Therefore, using 80% as the cutoff point is not a good choice. Therefore, we can only reduce the similarity value.
But this brings up a new question, that is, will autonomously reducing the similarity affect the final effect?
The answer is yes, so reranking is needed at this time; first, sort the data recalled from milvus by similarity, and then get the ones with the highest similarity; then use these data to do Rerank reranking, so that irrelevant content can be filtered out as much as possible.
Of course, there are other ways besides this method. For example, due to the complexity of the documents, you can choose to save word, markdown, and excel documents in different collections; then use a multi-way recall strategy to recall from different documents at the same time, and finally rerank the recall results.
The principle of this method is to minimize the interference between different documents during document pre-processing; then, during multi-way recall, the local recall strategy is optimized, and finally, screening is performed among these highly similar data.
Of course, there are many different solutions for how to recall data with high quality in RAG; but in the end there are only two ways.
One is to think about the early document processing, the purpose is to improve the quality of document splitting and minimize the impact on data recall; the other is to think about the recall strategy, whether it is the previous article on recall strategy or other more advanced recall methods; there is only one purpose, that is to recall data more accurately.
However, the accuracy of recall data often affects the speed of recall; therefore, in the case of large amounts of data, it is necessary to balance the relationship between speed and quality.