A Beginner's Guide to Large Model Applications for Technicians

Grasp the new trend of artificial intelligence, RAG technology opens a new chapter in information retrieval.
Core content:
1. RAG technology principles and its application in information retrieval
2. RAG's strategy and implementation to solve the limitations of large models
3. Key technology analysis and evaluation methods to help technical people apply practice
Combining the author's learning about big models and AI technology and understanding of many practical application scenarios based on AI transformation, I wrote this article. In addition, this article will not spend too much space on the basic content of the algorithm, but focus on the understanding of the core technical concepts of AI application.
When you first start reading articles or books about artificial intelligence, you always hear terms such as LLM, chatGPT, RAG, Agent, etc., but you don’t know where the technical points corresponding to these terms are related. It doesn’t matter. Let’s first learn the definitions of these terms:
AI: abbreviation of Artificial Intelligence, which refers to computer systems or software that simulate human intelligence, enabling it to perform complex tasks such as learning, reasoning, problem solving, perception, language understanding, etc.
Generative AI: It is an artificial intelligence technology that can automatically generate new content, such as text, images, audio and video. Unlike traditional AI, generative AI can not only analyze and understand data, but also create new content based on the information it has learned.
AIGC: The abbreviation of AI Generated Content, which means content generated by artificial intelligence. In the field of algorithms and digital content production, AIGC involves the use of artificial intelligence technology to generate various forms of content, such as text, images, videos, music, etc.
NLP: The abbreviation of Natural Language Processing, which refers to "natural language processing". Natural language processing is a subfield of artificial intelligence that mainly studies how computers understand, interpret and generate human language. NLP technologies include text analysis, language generation, machine translation, sentiment analysis, dialogue systems, etc.
Transformer: A deep learning model for natural language processing (NLP) tasks, originally proposed by Vaswani et al. in a 2017 paper. It introduces a mechanism called "self-attention" that can effectively process sequence data and has achieved great success in many NLP tasks such as machine translation, text generation, and language modeling.
BERT: Bidirectional Encoder Representations from Transformers, is a pre-trained model for natural language processing (NLP). It was first proposed by the Google AI research team in 2018. The main innovation of BERT is that it uses a bidirectional (i.e. context-sensitive) Transformer model to encode text.
PEFT: Parameter-Efficient Fine-Tuning, which is a method of fine-tuning machine learning models to reduce the number of parameters that need to be updated, thereby reducing computing costs and storage requirements while maintaining model performance. PEFT technology is particularly important in downstream task adaptation of large pre-trained models (such as BERT, GPT, etc.), because directly fine-tuning these models may consume a lot of computing resources and time.
LoRA: short for Low-Rank Adaptation, a technique for fine-tuning large-scale language models. It significantly reduces the number of parameters and computational overhead by decomposing the model's weights into a low-rank matrix, allowing the model to be efficiently adaptive even in resource-constrained environments.
LLM: The abbreviation of Large Language Model, which refers to "large language model". This type of model is based on machine learning and deep learning technology, especially a technology in natural language processing (NLP). Large language models are trained with large amounts of text data to generate, understand, and process natural language. Some famous LLM examples include OpenAI's GPT (Generative Pre-trained Transformer) series of models, such as GPT-3 and GPT-4
RAG: Retrieval-Augmented Generation, which is a framework that spans retrieval and generation tasks. It retrieves relevant information from a database or document collection, and then uses a generation model (such as a Transformer model) to generate the final output. In terms of current technology development trends and application implementation, RAG is an area worth exploring for engineering students.
Agent: In Chinese, it is called intelligent body, an entity that can independently perform tasks and make decisions. In artificial intelligence, an agent can be a robot, a virtual assistant, or an intelligent software system that can complete complex tasks through learning and reasoning. In a multi-agent system, multiple independent agents collaborate or compete with each other to jointly solve problems or complete tasks.
GPT: The abbreviation of Generative Pre-trained Transformer, which refers to "Generative Pre-trained Transformer". The GPT model is pre-trained using a large amount of text data, and can then be fine-tuned to perform specific tasks such as language generation, question answering, translation, text summarization, etc.
LLaMA: Large Language Model Meta AI is a series of large natural language processing models developed by Meta. These models perform well in handling text generation and understanding tasks, similar to other famous large language models such as GPT-3
chatGPT: An AI chatbot developed by OpenAI based on the GPT (Generative Pre-trained Transformation) architecture. It uses natural language processing technology to understand and generate human-like text replies. It can be regarded as an agent.
Prompt: refers to an initial text provided to the model to guide the model to generate subsequent content.
Embedding: It is a technique for mapping high-dimensional data into a low-dimensional space, while preserving the features and structure of the original data as much as possible. Embedding technology is often used to process and represent complex data such as text, images, music, and other high-dimensional data types.
Development Trend of Large Model Application
▐ Vector Database
With the rapid development of Internet content, unstructured data represented by multimedia content such as audio and video has shown a trend of rapid growth. The storage and retrieval requirements of unstructured data such as pictures, audio, and video are also increasing.
IDC DataSphere data shows that by 2027, global unstructured data will account for 86.8% of the total data, reaching 246.9ZB; the total global data volume will increase from 103.67ZB to 284.30ZB, with a CAGR of 22.4%, showing a stable growth trend.
Link: https://www.idc.com/getdoc.jsp?containerId=prCHC51814824
In order to manage unstructured data more efficiently, it is common to convert it into vector representation and store it in a vector database. This conversion process is often called vectorization or embedding. By mapping text, images, or other unstructured data to a high-dimensional vector space, we can capture the semantic features and potential relationships of the data. Vector databases enable fast similarity searches by building indexes on "vector representations."
Vector databases are databases used to store and query high-dimensional vector data, and are widely used in search, recommendation systems, image recognition, natural language processing, etc. With the continuous emergence of innovative AI applications, the demand for vector databases has also increased significantly.
1. Faiss (Facebook AI Similarity Search):
Developer: Facebook AI Research
Features: Efficient similarity search and dense vector clustering, support CPU and GPU acceleration.
Applicable scenarios: image similarity search, large-scale recommendation systems, etc.
2. Annoy (Approximate Nearest Neighbors Oh Yeah):
Developer: Spotify
Features: Efficient memory-based nearest neighbor search, using a built-in persistent tree data structure.
Applicable scenarios: music recommendation, quick search, etc.
3. HNSW (Hierarchical Navigable Small World):
Developer: Yury Malkov (and other community contributors)
Features: Small-world graph algorithm, efficient approximate nearest neighbor search, support for dynamic insertion and deletion.
Applicable scenarios: real-time search and recommendation systems.
4. Elasticsearch with k-NN Plugin:
Developer: Elastic
Features: Add k-NN search function on top of Elasticsearch, combining full-text search and vector search.
Applicable scenarios: Comprehensive search engines that need to support both text and vector queries.
5. Milvus:
Developer: ZILLIZ
Features: Distributed, high-performance vector database that supports large-scale data management and retrieval.
Applicable scenarios: Storage and retrieval of large-scale vector data such as images, videos, and text.
6. Pinecone:
Developer: Pinecone
Features: Vector database dedicated to machine learning applications, easy to integrate and extend.
Applicable scenarios: personalized recommendations, semantic search, real-time machine learning applications, etc.
7. Weaviate:
Developer: SeMI Technologies Features: Open source vector search engine, supports context-aware semantic search, and is highly scalable. Applicable scenarios: knowledge graph construction, semantic search, and recommendation system.
8. Vectara:
Developer: Vectara, Inc. Features: A fully managed vector-based search service focused on semantic search and relevance. Applicable scenarios: search engine optimization, natural language processing applications.
The current mainstream vector database solutions mentioned above face great challenges in terms of vector data storage cost and recall rate. With the further growth of unstructured data, the challenges of cost and recall rate will become more and more difficult. There are currently the following development trends in the presentation direction of vector databases:
1. Storage and index optimization
Quantization technology: Using vector quantization (VQ) technology, such as product quantization (PQ) or additive quantization (AQ), can significantly reduce storage and computing resources while ensuring accuracy.
Compressed vectors: Use hashing methods such as Locality-Sensitive Hashing (LSH) to reduce storage consumption and speed up similarity searches.
Distributed storage: Using distributed file systems and databases (such as Apache Hadoop and Cassandra) can optimize the storage and query of large-scale vector data.
Memory-level adjustment: Use solid-state drives (SSDs) or even emerging persistent memory (PMEM) to find a balance between memory and disk and optimize storage costs.
2. Recall rate optimization
Hybrid search technology: Combining coarse-grained and fine-grained indexes, such as using coarse filtering technology to quickly narrow the search scope and then perform precise search.
Approximate nearest neighbor search (ANN) algorithm: ANN algorithms such as those used in HNSW (Hierarchical Navigable Small World) graphs and FAISS can optimize search speed while ensuring high recall rate.
Multi-level retrieval: A hierarchical retrieval method that proceeds from coarse to fine, gradually improving the recall rate and precision.
3. System architecture and infrastructure
Cloud computing and elastic expansion: Use cloud computing platforms (such as AWS, Azure, and GCP) to expand computing and storage resources on demand.
- ▐ RAG
▐ Workflow
The RAG workflow involves three main phases: data preparation, data recall, and answer generation. The data preparation phase involves identifying data sources, extracting data from data sources, cleaning data, and storing it in a database. The data recall phase involves retrieving relevant data from the database based on the query conditions entered by the user. The answer generation phase uses the retrieved data and the query conditions entered by the user to generate output results. The quality of the output depends on the data quality and the retrieval strategy.
Data preparation
Depending on the type of task that LLM needs to handle, data preparation usually includes identifying data sources, extracting data from data sources, cleaning data, and storing it in a database. The type of database used to store data and the steps to prepare data may vary depending on the application scenario and retrieval method. For example, if you use a vector repository like Faiss, you need to create embeddings for the data and store them in the vector repository; if you use a search engine like Elasticsearch, you need to index the data into the search engine; if you use a graph database like Neo4j, you need to create nodes and edges for the data and store them in the graph database.
Data Recall
The main task of the data recall part is to retrieve information related to the input from a large text database. In order to ensure that the correct answer is sent to the generator part as much as possible, the recall rate of the data recall part is very important. Generally speaking, the larger the number of recalls, the higher the probability of the correct answer being recalled, but at the same time, it will face the problem of context length limitation of large models.
Many open source blogs or frameworks use vector search to find the k most similar candidates in this part of the process. For example, if we are building a question-answering system and use a vector database to store relevant data blocks, we can generate vectors for the user's questions, perform similarity searches on the vectors in the vector database, and retrieve the most similar data blocks. In addition, we can also perform hybrid searches on the same database or use multiple databases for searches based on user questions, and combine the results to pass as the context of the generator.
Regarding the retrieval part, there are many techniques to improve the retrieval effect, which will introduce more small modules, such as candidate rearrangement, large model assisted recall, etc., which all belong to the category of data retrieval.
Answer Generation
Once the data snippet relevant to the user’s question is retrieved, the RAG system passes it along with the user’s question and relevant data to the generator (LLM). The LLM generates output using the retrieved data and the user’s query or task. The quality of the output depends on the quality of the data and the retrieval strategy, while the instructions for generating the output also greatly affect the quality of the output.
Advantages and Disadvantages of RAG
Advantages of RAG
The basic content of RAG was introduced above. Now let’s sort out the advantages of RAG in detail.
High-quality answer generation, reducing the illusion of answer generation
One of the advantages of RAG is that it can generate high-quality answers. This is because during the generation process, the retriever can retrieve information related to the question from a large number of documents and then generate answers based on this information. This allows the entire system to make full use of existing knowledge to generate more accurate and in-depth answers, which also means that the model is less likely to hallucinate answers.
Scalability
RAG demonstrates excellent scalability, which means it can easily adapt to new data and tasks. Using RAG's retrieval-generation framework, the model can adapt to new knowledge domains by simply updating the data in the retrieval part. This allows RAG to remain highly adaptable when facing new domains or changing knowledge bases.
Model Interpretability
RAG has a certain degree of interpretability, which means that we can understand how the model generates responses. Due to the characteristics of RAG, we can easily trace back which documents the model extracts information from. This allows us to evaluate whether the model's responses are based on reliable data sources, thereby improving the credibility of the model.
Cost-effectiveness
Since RAG’s knowledge base can be decoupled from the generative model, as long as a certain amount of data is available, enterprises can use RAG as an alternative to fine-tuning, which may require a lot of resources. This model is very friendly to small and medium-sized enterprises. From another perspective, since the data of enterprises are private, providing relevant documents as background information can make the generated results more accurate and practical to meet the specific task needs of enterprises.
Disadvantages of RAG
Depends on the search module
The answers given by the RAG system are extremely dependent on the quality of retrieval. If the retrieved documents are irrelevant to the question or are of low quality, the generated answers may also be of low quality. If the searched documents do not cover the answer to the question, the model will basically be unable to answer the questions asked by the user. Therefore, in practical applications, we will use many strategies to improve the recall rate of document fragments. In many scenarios, the timeliness of document fragments is also a part to be considered. For example, in financial scenarios, users ask what the gold stocks are in October. If the recalled fragments do not include the brokerage gold stock research reports in October, or even recall many old gold stock research reports, it will cause great interference to the final large model generation. There are many other recall situations that will affect the result generation of the model. Therefore, if you want to build a good RAG system, the retrieval part is extremely important and requires a lot of time to polish.
Relying on existing knowledge base
RAG relies on an existing document database for retrieval. First, without a large-scale knowledge base, the advantages of RAG cannot be brought into play. Second, if the knowledge base does not cover enough and cannot recall the corresponding knowledge blocks, the model will not be able to give an answer because it needs to follow the constraints of the instructions, which will affect the problem coverage of the entire system.
Reasoning time
Since the RAG system needs to retrieve documents first and then generate answers, the reasoning of the entire system takes longer than pure large model reasoning. In this case, it cannot meet the needs of some scenarios with high latency requirements. However, this time-consuming problem is a common problem of large models. When using the web-based ChatGPT, it displays in a streaming typewriter mode and outputs results word by word, so users may not feel it is very slow, but if the statistics are counted from the question to the complete generation of the answer, it still takes a very long time.
Context Window Limits
The number of document fragments output by the recall module needs to take into account the maximum length that the generation model can handle. For example, the maximum context length of the earliest ChatGPT (GPT-3.5-turbo) is 4096 tokens. If your document fragment is 512 tokens long, you actually need to use 8 fragments (512×8 = 4096), so the recall part needs to consider how to optimize the recall rate among these 8 fragments. However, there are other compromises that can recall more document fragments. For example, you can use strategies such as compressing the retrieved document fragments and summarizing the key points with the help of large models. You can also apply length extrapolation techniques to the generation-side model. The existing length extrapolation strategies are relatively mature, and there are many excellent extrapolation strategies that can make the length of model reasoning far exceed the length of the training phase.
▐ Prompt word project
Prompt Engineering is a method of developing and designing prompts in the field of artificial intelligence and natural language processing to guide large language models (such as GPT-3, etc.) to produce specific outputs. By carefully constructing and optimizing prompts, users can more efficiently obtain the desired answers, generate text, or perform other natural language processing tasks.
The key to prompt engineering is to find the right language and structure to clearly express the question or task so that the model can understand it more accurately and give relevant responses. This may involve repeated trials, adjusting the details of the prompt, and using the understanding of the model's behavior to optimize the results. Next, we will use some basic cases to introduce how to optimize the prompt so that the large model can better answer our questions.
▐ Model fine-tuning
QLoRA (Quantized LoRA)
QLoRA is a combination of model quantization and LoRA. In addition to adding a LoRA bypass, QLoRA quantizes large models to 4-bit or 8-bit when loading, but dequantizes the parameters of this part to 16-bit for calculation during calculation. It optimizes the storage of model parameters in non-use state and further reduces the video memory consumption during training compared to LoRA.