Developing a large language model based on RAG

Explore the new realm of artificial intelligence and reveal how RAG technology revolutionizes information technology applications.
Core content:
1. The development of artificial intelligence technology and the rise of LLM
2. The principle of RAG technology and its application in university informationization
3. Comparison and advantage analysis of RAG and traditional search technology
At present, artificial intelligence technology is developing rapidly and has undoubtedly become the most dazzling dark horse in the new quality productivity. Especially since OpenAI launched ChatGPT, artificial intelligence has achieved a perfect combination of big data, big computing power and big algorithms, truly allowing ordinary people to feel the "magic" of artificial intelligence.
With the emergence and development of more and more general large language models (LLMs), how to use LLM capabilities to implement artificial intelligence applications has become the direction of the industry's efforts.
The information construction of colleges and universities has a large number of application construction needs. The combination of LLM-based capabilities and school information construction is also an exploration of the construction of smart campuses in colleges and universities.
This article mainly introduces the exploration of using RAG (Retrieval Augmented Generation) to develop information applications based on the LLM general large language model. In order to utilize the capabilities of LLM and apply some of the latest or internal knowledge to LLM, using Retrieval Augmented Generation (RAG for short) is one of the solutions.
As the name implies, RAG is to enhance the capabilities of the LLM model through retrieval. To use a vivid metaphor, RAG is equivalent to LLM taking an open-book exam. When you encounter a question you don’t know how to answer, you can check the existing answers and then answer the question based on the answers. The existing answers here are the latest or internal knowledge files prepared in advance.
In summary, RAG is a technical means based on deep learning and natural language processing. It can organically combine the two tasks of retrieval and generation to achieve more intelligent information retrieval. Compared with traditional retrieval technology, RAG can search results more accurately and personalized, and generate answers or explanations related to the questions, improving the efficiency and accuracy of information retrieval.
Introduction to LLM
Large model, full name "Large Language Model", English "Large Language Model", abbreviated as LLM. Large language model has the ability to understand, generate and process natural language, and plays a role in various natural language processing tasks, such as text summarization , machine translation, question-answering system, etc. This article does not introduce LLM in detail, but explains the difference between dialogue products and large models. Large models were briefly introduced above, while dialogue products are services or software developed using the capabilities of large models and provided to users. Table 1 lists some dialogue products and large language models in China and the United States.
There are some limitations in the LLM application process.
The first is the real-time problem of LLM knowledge. At present, the knowledge learned by many large language models is not real-time, and a typical example is ChatGPT. If some newer knowledge points and events are involved, LLM will not be able to respond accurately.
Second, LLM cannot know the internal knowledge of related industries or units. For example, LLM does not know some management regulations within the school or industry. Therefore, when using LLM, relevant questions cannot be answered.
Table 1 Main large models and dialogue products
Using RAG to build an intelligent question-answering system
In the information construction of colleges and universities, intelligent question-answering systems are common application systems used to answer questions raised by teachers and students. The traditional approach is to achieve this by matching keywords with knowledge bases. When the questions raised by teachers and students happen to have keywords that match, the system can answer the questions correctly; if the questions raised by teachers and students are replaced with words with similar meanings that cannot match keywords, the system may not be able to give a correct answer. The reason for this problem is that the intelligent question-answering system does not really understand the semantics of the questions raised by teachers and students. LLM and RAG can solve this problem well.
There are four main steps to build a RAG system based on LLM.
The first step is document loading and slicing. First, we need to slice the knowledge base documents according to a certain size, because both keyword search and vector search are based on document slices. Currently, many mature software can easily complete this task.
The second step is text vectorization. Text embeddings converts text into a set of floating-point arrays, and the entire array can correspond to a point in multidimensional space, that is, a text vector. The purpose of text vectorization here is to use the corresponding mathematical tools to calculate the relationship between texts after vectorization, that is, vector similarity calculation. Text vectorization is generally done directly using the vector tools provided by LLM. In the actual development process, the tool "text-embedding-ada-002" from OpenAI is used.
The third step is to import the document into the search engine or vector database. We import the sliced document into the corresponding search engine for document retrieval. If the imported search engine is elasticsearch, keyword retrieval is implemented. If the imported search engine is a vector database, vector retrieval can be implemented. Here we implement the RAG function by using vector database retrieval. What we need to note here is that the vector database is used for fast retrieval of vector data, and the vector database cannot replace the function of the traditional database; the vector database itself does not generate vector data, and the vector data is generated by the vectorization tool.
The fourth step is to encapsulate the retrieval interface. After completing the previous work, you can use the LLM interface and the Prompt template to complete the final question answering work. In actual work, the API interface provided by ChatGPT3.5 is used. The actual calling process is shown in Figure 1.
Figure 1 RAG workflow
Key technical points
Calculation of text vectors. As shown in Figure 2, the vectors of text can be obtained through the vectorization tool. So how to calculate the similarity of text and the distance between vectors? Construct comparison samples of relevant (positive examples) and irrelevant (negative examples) sentences and train the dual-tower model to make the distance between positive examples small and the distance between negative examples large. The calculation methods used here are Euclidean distance and cosine distance.
Figure 2 Text vectorization
Selection of vector database. The key to the efficiency of intelligent question-answering system is the retrieval capability of vector database. Currently, there are many vector databases to choose from in the industry, as shown in Table 2.
Table 2 Comparison of functions of mainstream vector databases
Disadvantages of RAG
The RAG model combines the advantages of the retrieval model and the generation model. It can generate documents based on a broader knowledge base, thereby improving the accuracy and information richness of the generated results. However, there are still some shortcomings.
First, reliance on external databases: The operational performance of RAG is highly dependent on the quality and coverage of external knowledge bases. If the information in the knowledge base is incomplete or outdated, it is likely to affect the accuracy of the generated results.
Secondly, the difficulty of technical implementation: when using the RAG model to develop applications, it is necessary to integrate the retrieval and generation models, especially the deployment of vector databases, which may require a high level of technology and computing resources.
Finally, the shortcomings of vector search: Sometimes the most suitable answer is not ranked first in the vector search results. In this case, the problem can be solved by recalling and re-ranking the scores. When there are long proper nouns in the document, vector search may be inaccurate. In this case, hybrid search, that is, a combination of keyword search and vector search, can be used to solve the problem.
With the continuous improvement of computing power and AI technology, the scale and performance of LLM will continue to increase, and its future applications will be more extensive. In the future, it will provide more intelligent and personalized services for humans, further improving people's lives and production methods. In 2024, a number of LLM-based applications will be implemented in related industries across the country, so how to combine LLM with the information construction of universities also requires us to continue to develop and explore.