Making Big Models “Remember” More: RAG and Long-term Memory

Written by
Iris Vance
Updated on:June-24th-2025
Recommendation

Learn how RAG technology enhances the memory capacity of large models and breaks through the limitations of traditional dialogue.

Core content:
1. Introduction to RAG technology and its application in large models
2. The core process of RAG technology: data retrieval, information enhancement and answer generation
3. The necessity of long-term memory and its combination with RAG technology

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

The update recently released by OpenAI has enhanced the memory function of the original ChatGPT. When answering user questions, the new version of the GPT model can not only remember the user's past chat records, but also retrieve memories across chats to generate more accurate answers.



Before this, when we used ChatGPT and other similar AI chat software, the information between conversations was not shared, which led to a problem: every time we started a new conversation, we had to repeatedly tell the big model some "background information". I believe that when you encounter this situation, you will naturally think about a question: Can you let the big model "remember" what I said before?

This is the problem that long-term memory hopes to solve. Today we will talk about how to make large models "long-term memory".


Before introducing long-term memory, let us first introduce a similar content - RAG.


What is RAG?


RAG, the full name of  Retrieval-Augmented Generation , is a framework proposed by OpenAI, Meta and other companies to enhance the knowledge capabilities of language models. It was proposed to solve the "hallucination" problem of large models.


In simple terms, the big model learns a lot during training, but its knowledge is static , that is, after training, the model cannot automatically understand new knowledge or dynamically changing information. The problem this brings is that the big model performs well in general scenarios, but once asked about information about private or professional data sources, the big model will generate some inaccurate responses that seem to be "serious" nonsense. This is the big model illusion .


Therefore, large models need to be adjusted and optimized in a targeted manner according to the knowledge in different professional fields, which is called supervised fine-tuning . However, supervised fine-tuning is very costly. Is there any way to reduce the cost of such customized engineering? This is where RAG comes in.


The principle of RAG is to combine information retrieval with natural language generation to improve the quality and accuracy of text generation. Retrieval technology is used to enhance the new functions of the generation model, especially when dealing with knowledge-intensive tasks. By combining information from external knowledge bases and corpora, RAG can generate more accurate, detailed and useful information.


Therefore, the emergence of RAG is like adding a "search engine" to the model. The model does not just rely on its own "thinking", but first searches for information and then answers questions .



RAG's core processes:


1. Data Retrieval


RAG uses retrieval models to search for text snippets or information related to the input query in a large corpus or database.


2. Information Enhancement


The retrieved relevant information fragments are further processed to provide useful context or knowledge for the subsequent text generation stage. (In some cases, RAG may also filter, sort, or reorganize the retrieved information to ensure its quality and relevance to the input query.)


3. Generate answers


The model combines the retrieved information with the input query, and RAG uses a natural language generation model to generate new text content.


In this way, the model can use information from the "external brain" and no longer work in isolation.

Why do we need "long-term memory"?


After introducing RAG, let’s take a look at long-term memory.


Although RAG has enabled the big model to look up information in real time, it does not "remember" what you have said before . Each conversation starts almost from scratch, which brings up a problem: "Our conversation cannot establish a continuous context."


The goal of the long-term memory system is to build a "memory bank" that can be accessed and called at any time , just like humans .


For example


You once talked to an intelligent assistant about your love of traveling, spicy food, and being from the north. Next time you meet, it can proactively ask you:


"Last time you mentioned that you like the beach in Dalian. Where are you planning to go this time?"


This is where long-term memory comes in—  allowing the AI ​​to truly understand you and continually accumulate information about conversation context and preferences .



How to achieve "long-term memory"?


Now that we have explained the concept, let's see how to implement this process.


1. Vector database + search (most common)


That is, using the RAG method, your historical conversations and personalized information are saved. The specific process is to convert the conversation information into vectors and store them in vector databases such as FAISS and Milvus. When the model needs to "recall", it retrieves relevant information from the database and adds the result to the prompt of the model input.


The advantage of this approach is that it is relatively simple to deploy because it is based on the existing RAG system ; and because of the support of the vector database, it is also highly scalable and can support up to a million memory contents. However, because it does not have structured storage to classify memories, it will have low accuracy problems in the retrieval stage, and may not understand "time order" or "context dependency" during the storage process, resulting in the omission of this part of information in the memory.


2. Slot-based memory management


The vector database has low accuracy because it does not use structured storage. To solve this problem, it is necessary to structure the memory, that is, to split the memory into multiple "slots". The model selects the slot to be activated based on the context and dynamically combines prompts to generate more accurate answers. For example, the large model gets a user_input with the content: "Xiao Ming is from Chongqing and likes to eat chili peppers". Then its structured storage can be stored as follows:


●User name: Xiaoming

●Hobbies: I like spicy food

●Background information: Chongqing people


The advantage of this is that the memory storage is more structured , which facilitates memory storage and memory recall . For some scenarios where the business process is relatively clear and fixed, the work of managing memory will be more convenient. However, in the implementation process, developers need to manually set the slots, which has poor flexibility and compatibility . And when there are too many slots, management is very complicated.


3. Multi-round dialogue chain + automatic summary (summary memory)


Since slot memory requires manual design of memory structure, can we let the large model do this work by itself? Summarization memory provides a solution to this problem: let the model "write a diary" regularly and compress the conversation history through summarization. That is, at the end of each conversation, automatically summarize a memory, or let the model regularly "reflect" on past conversations, storing "highly abstract" information in the conversation rather than its original content.


For example, if a user asks about travel tips, the big model will summarize the conversation as: "The user plans to travel to XX and is interested in food and transportation information."


This storage method saves token costs, makes memory more compact, and is closer to "human memory". However, because the summarization process is handed over to the model for processing, inaccurate summarization will lead to "misremembering" of the model, and the lack of details in the summarized memory will affect the accuracy of memory recall.


Hybrid: The most mainstream solution at present


In fact, many advanced long-term memory systems combine the above methods:


for example:


●Use vector data to store raw memory fragments

●Use slots to store structured long-term information (such as role settings, interest preferences)

●Use summary mechanism to compress context and improve efficiency


Companies such as OpenAI, Meta, Anthropic, Mistral, etc. almost all use this "hybrid memory architecture" when building agent systems.


Some representative practical projects


In addition to the introduction to long-term memory implementation, here we introduce two widely used long-term memory systems: mem0 and memGPT.


mem0: A lightweight, practical, use-first memory system


mem0 is a lightweight long-term memory framework built by community developers, which is very suitable for practical deployment in AI assistants, agents or applications.


Its core design concepts:


✅ Memory is searchable and manageable: hybrid retrieval via natural language indexing + vectorization.

✅ Supports multiple memory hierarchical structures: such as "person profile", "event record", "tag topic", etc.

✅ Support automatic summary and reflection mechanism: The model regularly summarizes recent conversations to form a more solid memory foundation.

✅ Support "memory trigger" mechanism: when the conversation triggers a keyword or semantic clue, the relevant memory is automatically retrieved.


mem0 can be more easily connected to frameworks such as LangChain and AutoGPT, and is the preferred solution for many teams building "memory-enabled intelligent entities".


MemGPT: A simulator of human-like memory


MemGPT is a human-like memory architecture proposed by researchers from Stanford and other universities. It introduces two concepts:


1. Working Memory: Immediate information used for current conversations and tasks, similar to human short-term memory.


2. Long-Term Memory: Stores important historical information and can be retrieved at any time, similar to the human memory system.


Its biggest feature is that the memory is not fixedly inserted, but the model independently decides whether to "write" or "read".


●For example, when a user says an important piece of information, MemGPT will recognize “this is worth remembering” and automatically store it in long-term memory.


●In future conversations, if relevant clues are triggered, the model will actively "recall" relevant content and apply it to the answer.


This mechanism makes AI more like an intelligent entity that "can reflect, has preferences, and has selective memory."


To summarize:


RAG + long-term memory, a powerful combination!


Although both RAG (Retrieval Augmented Generation) and long-term memory are designed to improve the response quality of large models, they have different focuses: RAG focuses on retrieving factual content from external knowledge bases , such as documents, web pages, databases, etc.;


Long-term memory focuses more on the user's historical information and conversation context, such as what you said in the past, the preferences you mentioned, or your behavioral habits.


In other words, RAG and memory are not mutually exclusive, but complementary tools. RAG solves a wide range of knowledge retrieval problems, while the goal of memory is to enable AI to have intimate and personalized interactive capabilities.


for example:


●RAG is responsible for answering general questions such as “current weather, company policies, product documentation”  .


●Long-term memory is responsible for remembering  personalized information such as "who you are, what you said before, and what you like . "


A truly intelligent agent should be  able to both look up information and remember who you are.


Application scenarios: Making AI smarter and more human

RAG+ long-term memory is not just an upgrade on the technical level, but also a reshaping of the role of AI - it is no longer a cold tool, but a "digital individual" that can accompany, understand, and grow.


1. Corporate “super employees”


Efficient, stable, and never-leaving digital employees are quietly reshaping the way organizations operate.


RAG+ long-term memory can enable AI to become a "super employee" within the enterprise:


Remember each customer’s historical communications and preferences to avoid repeated communications.


Understand company processes, project background, and internal knowledge to gain more context when making decisions.


Support multi-role collaboration: from HR to customer service, from sales to products, unified access to a unified memory base to achieve knowledge sharing among multiple departments.


AI with memory is no longer just a “answer to whatever you ask” service, but a “virtual colleague” that can continuously accompany projects and learn and grow.


2. Intelligent customer service: a thoughtful and caring assistant


It’s not just about simply answering FAQs, but about truly remembering your last request.


One of the drawbacks of traditional customer service is "memory gap" - every consultation is like the first meeting. After introducing long-term memory, AI customer service can:


Remember the user’s historical questions and processing progress , and automatically continue the unfinished conversation from last time.


Understand users’ habits and emotional changes , and automatically adjust tone and style.


Combine the RAG system to check the latest policies and achieve efficient, accurate and personalized responses.


It is more than just a customer service, it is more like a personal assistant who understands your needs and is always online.


3. Learning Assistant: AI personal tutor who understands you


We no longer make one-size-fits-all recommendations, but instead teach students according to their aptitude and provide continuous follow-up.


Long-term memory gives AI a “teaching mindset”:


●Remember your knowledge structure, weak links and learning pace.


● Customize a personalized learning path by combining online teaching materials, wrong question records, and learning goals .


●Track your learning progress and review knowledge points at appropriate times instead of doing the same questions mechanically over and over again.


It understands you better than any app and is more timely than any teacher.


Summary: Will AI have “memory” like humans in the future?


The answer is: getting closer!


RAG gives large models the ability to search for knowledge, allowing them to become "professionals" in various fields. The goal of long-term memory is to make the model truly human-like, capable of understanding the past, present, and yourself.


Future large-scale intelligent agents should have the following memory capabilities:


Remember who you are


Remember “What You Said”


Remember “what you did”


●More importantly: Know when to recall them


And this is the key step towards "general intelligence".