Unlocking LLM knowledge base search: the key code behind high return rate

Explore knowledge retrieval technology and unlock the secret of efficient information location.
Core content:
1. The importance of knowledge retrieval algorithms and their basic models
2. Comparative analysis of vector space models, Boolean models and probabilistic models
3. Application of embedding models in knowledge retrieval and their influencing factors
Abstract: In this era of information explosion, knowledge retrieval is like a navigator in the vast ocean of data, helping us quickly find the information we need. Imagine that in a library, if you don’t know how to search, facing the mountains of books, finding a specific book is like looking for a needle in a haystack. Knowledge retrieval algorithms are the key tools that help us accurately locate in the vast amount of information.
At present, there are many kinds of knowledge retrieval algorithms, among which the Vector Space Model (VSM) is a common and basic algorithm. It represents text as a vector, and each dimension corresponds to the weight of a word or feature. For example, in an article introducing apples, the frequency of occurrence of words such as "apple", "fruit", and "nutrition" in the article and their distribution in the entire corpus will determine their weights in the vector. By calculating the similarity between different text vectors, the relevance between texts can be determined. For example, when we search for "nutritional value of apples", the system will also convert this query into a vector, and then compare it with all text vectors in the database to find the text with high similarity and return it to us.
The Boolean model is based on set theory and Boolean algebra. It represents query terms as Boolean expressions, and uses keywords and logical operators (such as AND, OR, and NOT) to express the characteristics that users want documents to have. For example, if we want to retrieve "documents that contain both apples and not bananas", we can use the Boolean expression "apple AND NOT banana" to achieve this. This model is simple and direct, but it lacks the concept of document classification. The search results only have two states: relevant and irrelevant, which may result in too many or too few results being returned.
There is also a probabilistic model, which describes the distribution of random variables through probability distribution and describes the conditional relationship between events through probability rules. In knowledge retrieval, it sorts documents according to the probability of their relevance to the query. Given a user's query string, the probabilistic model assumes that there is an ideal result set containing all relevant documents. Although we cannot know the properties of these documents exactly, we can estimate these properties through index terms, thereby returning an ideal result set for the first retrieved document set and generating a preliminary probabilistic description. For example, in a news retrieval system, the probabilistic model can calculate the probability of each news article being relevant to the user's needs based on the user's historical browsing history and current query, and display the news with a high probability to the user in front. This article will introduce common embedding models and the key factors that affect retrieval returns.
Common embedding models
Several key factors affecting search results
Knowledge Graph: The “Hero Behind the Scenes” of Intelligent Retrieval
01
—
Common embedding models
An embedding model is a model that maps high-dimensional, discrete data (such as words, sentences, pictures, etc.) to a low-dimensional, continuous vector space. Through the embedding model, discrete objects can be represented as dense vectors with semantic or structural information. These vectors usually have good mathematical properties. For example, the distance or direction between vectors can reflect the similarity or relationship between objects. Embedding models are widely used in natural language processing (NLP), recommendation systems, computer vision and other fields.
Common embedding models and their application scenarios
Word embedding(Word Embedding) : This is the most common type of embedding model in natural language processing. The goal is to represent each word as a vector of fixed dimension so that semantically similar words have similar vector representations. Common word vector models include:
Word2Vec: A word vector model proposed by Google, which uses two main training methods: Skip-Gram and CBOW (Continuous Bag of Words). Word2Vec learns the vector representation of words by predicting words in the context. Its advantages are fast calculation speed and ability to capture the semantic relationship between words. GloVe: Global Vectors for Word Representation, proposed by Stanford University. GloVe learns word vectors by constructing a word co-occurrence matrix and optimizing the low-rank decomposition of the matrix, focusing on the global statistical information between words. FastText: Proposed by Facebook, it not only considers the word itself, but also the sub-word information of the word (that is, the letter combination), which is suitable for processing rare words and unregistered words.
Text Embedding Model: Such as Gemini Embedding, an AI-based text processing model launched by Google, can convert text into numerical representation (vector), and support functions such as semantic search, recommendation system and document retrieval. Gemini Embedding has shown better performance in many fields such as finance and science, and supports more than 100 languages and larger text processing volumes.
BGEBAAI General Embedding : Developed by the Zhiyuan Research Institute team, it supports multiple languages (Chinese and English) and has multiple versions (such as bge-large-en, bge-base-en, bge-small-en, etc.), suitable for tasks such as retrieval, classification, clustering or semantic search. The BGE model achieved first place in both the MTEB and C-MTEB benchmarks, and is open source and free to use under the MIT license.
02
—
Several key factors affecting search results
Similarity threshold: a filter for filtering results
In knowledge retrieval, the similarity threshold is like a "filter" for screening results, which determines the relevance and quantity of the retrieval results. Simply put, the similarity threshold is a pre-set value used to measure the similarity between the retrieval results and the query content. When the similarity score between the document and the query calculated by the system is higher than this threshold, the document may be returned to the user as a retrieval result; conversely, if the similarity score is lower than the threshold, the document will be filtered out.
For example, in a news retrieval system, we set the similarity threshold to 0.7. When a user queries "the application of artificial intelligence in the medical field", the system calculates the similarity between each news in the database and the query. If a news article reports on the application of artificial intelligence in medical imaging diagnosis, and the similarity calculation result with the query is 0.8, which is greater than the set threshold of 0.7, then this news article will be returned to the user. But if another news article only briefly mentions artificial intelligence, and the main content is about the application of artificial intelligence in the field of transportation, the similarity with the query is only 0.5, which is lower than the threshold, and it will not appear in the retrieval results.
The setting of the similarity threshold has an important impact on the search results. If the threshold is set too high, although the results returned will be highly relevant, some potentially useful information may be missed, resulting in too few results returned. For example, in the above news retrieval example, if the threshold is raised to 0.9, some news that mentions the application of artificial intelligence in the medical field but is not particularly accurate may not be returned, and the information obtained by users will be limited. On the contrary, if the threshold is set too low, the number of results returned may be large, but it will contain a lot of information with low relevance, which increases the difficulty for users to filter out valid information. Assuming that the threshold is lowered to 0.3, some news that only mentions the words "artificial intelligence" or "medical" but has no content related to the combined application of the two may also be returned, making the search results messy.
Keyword similarity weight: the "compass" for precise matching
Keyword similarity weight is another important factor in knowledge retrieval. It is like a "compass" for precise matching, guiding the system to find the most relevant content for the query. Keyword similarity weight is used to measure the similarity between keywords and documents or queries. By assigning different weights to different keywords, their importance in retrieval is highlighted.
For example, in an e-commerce product retrieval system, when a user enters "Apple phone" for a query, the two keywords "Apple" and "phone" are both critical to determining the product the user wants. But if the user enters "red Apple phone", the keyword "red" also plays a certain role, but its importance may be slightly lower than that of "Apple" and "phone". In this case, we can assign a higher weight to "Apple" and "phone", such as 0.4, and a relatively lower weight to "red", such as 0.2. In this way, when the system searches for products in the database, it will be more inclined to return product information that contains both "Apple" and "phone" and "red" as an auxiliary description, rather than returning the product simply because it contains the word "red".
In practical applications, the setting of keyword similarity weights can be adjusted according to specific business scenarios and needs. For example, in academic literature retrieval, some professional terms and core concept keywords can be given higher weights, because these keywords can often more accurately reflect the subject and content of the document. Suppose we are searching for an academic paper on "artificial intelligence deep learning algorithm". Keywords such as "artificial intelligence", "deep learning" and "algorithm" are crucial to accurately find relevant documents, and their weights can be set higher. Some auxiliary descriptive words, such as "research" and "application", although they also have a certain effect, can have relatively lower weights. By reasonably setting the keyword similarity weights, the system can sort the search results more accurately, put the documents that best meet user needs in front, and greatly improve the quality and availability of the search results.
TOP N: "Controller" of the number of results
In knowledge retrieval, TOP N is a parameter used to limit the number of results returned. It is like a "controller" that helps us accurately obtain the required amount of information. Simply put, TOP N means that only the top N search results are returned. For example, when we set TOP N to 5, the system will only return the top 5 results with the highest relevance to the query.
In an e-commerce search scenario, when a user searches for "sports shoes", if there are tens of thousands of relevant product records in the database, without setting TOP N, a large amount of product information may be returned, making it difficult for users to quickly find what they want among the many results. However, if TOP N is set to 10, the system will sort the products based on the relevance to the query "sports shoes", sales volume, price and other comprehensive factors, and only return the top 10 sports shoes products, greatly improving the efficiency of users' information search.
In different scenarios, the appropriate TOP N value has an important impact on the search results. In academic literature retrieval, if the user wants to quickly understand the core results of a research field, setting a smaller TOP N value, such as 3-5, can return the most cited and influential documents in the field, helping users to quickly grasp the research focus. However, if the user is conducting preliminary exploratory research and hopes to obtain more comprehensive information, a smaller TOP N value may result in missing important content. At this time, appropriately increasing the TOP N value, such as setting it to 20-30, can return more relevant documents and allow users to have a broader understanding of the field.
However, TOP N also has certain limitations. It can only return the top N results. When we need to obtain the results in the middle or lower ranking, we cannot directly use TOP N to achieve it. For example, in a database containing the scores of 100 students, we want to query the information of students ranked 21st to 30th. Simply using TOP N cannot meet the needs. To overcome this limitation, we can combine other methods, such as querying all results first and then filtering them at the application layer; or use the paging function of the database to obtain results in a specific range through multiple queries.
03
—
Knowledge Graph: The “Hero Behind the Scenes” of Intelligent Retrieval
The knowledge graph is like an intelligent brain, silently providing powerful support for knowledge base retrieval, and can be called the "behind-the-scenes hero" of intelligent retrieval. The knowledge graph is essentially a semantic network that describes entities, concepts, and the relationships between them in the real world in a structured way. Simply put, it connects various information in the form of "entity-relationship-entity" triples to form a large and orderly knowledge network.
For example, in a knowledge graph about people, "Jack Ma" is an entity, and there is a "founded" relationship between him and the entity "Alibaba". At the same time, there is a "belongs to" relationship between "Jack Ma" and the concept of "entrepreneur". In this way, the knowledge graph can effectively organize and associate massive amounts of information, allowing computers to better understand and process this information.
In knowledge base retrieval, knowledge graphs have a wide range of application scenarios and significant advantages. When we enter a query word in a search engine, the knowledge graph can help the search engine understand the user's intention, not just a simple keyword match, but an analysis from a semantic level. For example, when a user queries "apple", if there is no knowledge graph, the search engine may return various web pages containing the word "apple", including introductions to fruit apples, news about Apple, etc., and the results may be messy. But with the knowledge graph, the search engine can judge whether the user is more likely to want information about fruit apples or Apple based on the various relationships of the entity "apple" in the knowledge graph, such as "apple-fruit-rich in vitamins", "apple-company-produces electronic products", etc., and thus return more accurate results.
In intelligent question-answering systems, knowledge graphs are also indispensable. When a user asks a question, the system can use the knowledge graph to reason and analyze, find entities and relationships related to the question, and give an accurate answer. For example, when a user asks "Who is the founder of Apple?", the knowledge graph can quickly and accurately answer the question through relationships such as "Apple - Founder - Steve Jobs" and "Apple - Founder - Steve Wozniak".
Tips to improve the accuracy of knowledge base retrieval
To improve the accuracy of knowledge base retrieval, we can start from multiple aspects. In the data processing stage, it is crucial to clean and preprocess the data in the knowledge base. This is like tidying up a room and sorting out the messy items so that they can be found more easily. We need to remove duplicate, erroneous and irrelevant data, and standardize the data, such as unifying the date format and standardizing vocabulary. For example, in an e-commerce knowledge base, if there are inconsistent product names, such as "Apple mobile phone" and "iPhone", they need to be standardized so that they can be matched more accurately during retrieval.
It is also key to select the appropriate retrieval algorithm and continuously optimize it. Different algorithms are suitable for different scenarios. We should choose according to the characteristics of the knowledge base and user needs. At the same time, tune the parameters in the algorithm, such as the similarity threshold and keyword similarity weight mentioned above, to find the most suitable value. You can also combine the advantages of multiple algorithms to form a combined algorithm to improve the accuracy of retrieval. For example, combine the vector space model with the probability model, first quickly filter out a batch of potentially related documents through the vector space model, and then use the probability model to sort these documents more accurately.
It is also essential to use user feedback to continuously improve the retrieval system. We can collect users' evaluations of the retrieval results, analyze their search behaviors, and understand their needs and pain points. For example, if many users are dissatisfied with the results returned when searching for "applications of artificial intelligence in the medical field", we can analyze the reasons, such as inaccurate keyword extraction or problems with similarity calculation, and then make targeted improvements. Through continuous optimization and improvement, the knowledge base retrieval system can better meet the needs of users and provide users with more accurate and efficient services.
The return rate of knowledge base retrieval is affected by a combination of multiple key factors, from the basic principles of knowledge retrieval algorithms, to the specific settings of similarity thresholds, keyword similarity weights, TOP N and other parameters, to the powerful support of knowledge graphs. Each factor plays a unique and important role in the retrieval process. These factors are interrelated and influence each other, and together determine the quality and quantity of retrieval results.