Explore the technical implementation of building AI Agents with scalable long-term memory

Written by
Jasper Cole
Updated on:June-25th-2025
Recommendation

In-depth analysis of AI Agents' long-term memory technology, how to break through contextual limitations and achieve coherent dialogue.

Core content:
1. Context window limitation problem of large language models
2. Key technologies for dynamic extraction, integration and retrieval of information
3. Graph-based memory representation method and application

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


The core of this study is to explore the inherent limitations of the fixed of large language models (LLMs) so that they can maintain coherence and consistency in long-term, multi-round conversations. The lack of this persistent memory capability causes AI agents to forget user preferences, repeat information, and deny previous facts.

Building a robust AI memory system that can go beyond a limited context window requires selectively storing important information, integrating related concepts, and retrieving relevant details when needed , which mimics the human cognitive process.

( The underlying mechanism is a two-stage memory pipeline that extracts, integrates, and retrieves the most salient conversation facts, enabling scalable long-term reasoning. )

Several key technical implementation methods:

1. Dynamically extract, integrate and retrieve information :   
  • This approach does not rely on keeping the entire conversation history within a fixed context window.
  • It manages salient information through dedicated modules, which usually includes two stages: extraction and updating .
  • Extraction phase:When processing a new message pair (e.g., a user message and an assistant reply), the system uses the conversation summary ( which captures the overall semantic content) and the recent ( which provides granular temporal context) as background information. Then, LLM is used to identify and extract a set of salient memories or facts in the conversation based on these contexts .
  • Update phase:Evaluate the extracted candidate memories against existing memories. The system retrieves existing memories that are semantically similar to the candidate memories. Then, it uses  the reasoning capabilities of the LLM and the Tool Call mechanism ( or similar function call interface) to decide on the appropriate memory management operation :
  • A DD (Add):When there is no semantically equivalent existing memory, a new memory is added.
  • UPDATE (UPDATE):Enhance existing memory with supplementary information.
  • DELETE (Delete):Remove memories that are contradicted by new information.
  • N OOP (No Operation):When the candidate fact does not require modification to the knowledge base.
  • This approach encodes complete conversation turns into concise natural language representations that capture only the most salient facts in memory, thereby reducing noise and providing more precise hints for LLM.
  • Its advantage is that it dynamically through a selective , thereby maintaining consistent performance regardless of session length . This approach performs well in single -hop and multi-hop queries , and generally achieves lower search and total latency and significantly reduced token consumption.

2. Graph- based Memory Representations :
  • This is an approach that enhances the above infrastructure, where the memory is stored as a directed labeled graph .
  • In this representation, entities are nodes ( e.g., people, places, events), and relationships are edges connecting these nodes (e.g., 'lives_in', 'prefers').
  • This structure is able to capture complex, relationship structures between conversation elements , leading to a better understanding of,the connections between entities.
  • Extract and Update:
  • The extraction process uses LLM to transform unstructured text into a structured graph representation. This typically includes entity extraction ( identifying key information elements in a conversation) and relationship generation ( determining semantic connections between entities).
  • The update conflict detection and resolution mechanisms , where the LLM decides whether to mark existing relationships as invalid when integrating new information.
  • This structure enables higher-level reasoning across interconnected facts and is particularly useful for queries that need to navigate complex relational paths.
  • Retrieval mechanism:There are a variety of approaches that can be used, such as entity- centric approaches ( identifying entities in the query and then exploring related nodes and edges in the graph database) and semantic triple approaches ( encoding the query vector and matching it with text-encoded relation triplets).
  • The underlying implementation can use a graph database such  as Neo4j .
  • This approach performs well in temporal reasoning and open domain tasks , verifying the advantages of structured relational graphs in capturing temporal relationships and integrating external knowledge. However, building and querying graph structures may introduce moderate latency overhead compared to simple natural language memory, and may require more tokens to represent the graph structure.
    ( Making the in-memory store consistent, non-redundant, and immediately ready for the next query. )

3. Technical concepts reflected in other :
  • Hierarchical Memory System:Divide memory into different levels (e.g. short-term memory/context window, long-term memory/external storage) and use mechanisms to "page" information between these levels.
  • Short-term /long-term memory components:For example, using conversation summaries as short-term memory, converting conversation turns into factual “observations” as long-term memory, and combining them with temporal event graphs.
  • G isting (gist extraction):Distill text paragraphs into concise summaries, retaining the core meaning but reducing the number of tokens, and retrieve the original text when needed.
  • Memory decay mechanism:Simulating human forgetting, memories are strengthened when retrieved, while unused ones decay over time.
  • Note- BasedMemories are dynamically constructed and evolved through interconnected notes, which contain structured attributes (keywords, descriptions, tags) and are updated as new memories are incorporated.
  • Temporal Knowledge Graph:Specialized for managing conversation content with timestamp information to handle time-sensitive queries.

These different technical implementation methods have different focuses on capturing information, representing knowledge, and retrieving relevance, and show different trade-offs in performance (such as accuracy, latency, and token consumption). Evaluating the effectiveness of these methods usually requires specialized benchmarks (such as LOCOMO) and metrics that can evaluate factual accuracy and contextual appropriateness (such as LLM-as-a-Judge), because traditional lexical similarity metrics have limitations.

Future research directions include optimizing the operations of these structured memories to reduce latency , exploring hierarchical memory architectures that combine , and developing more sophisticated memory integration mechanisms .

---The following is the original content of the paper---

Paper:  https://arxiv.org/abs/2504.19413

Long-term memory challenges

While recent advances have expanded the context window in models such as GPT-4, Claude 3.7 Sonnet, and Gemini, simply increasing the window size does not fully solve the long-term memory problem. Real-world conversations rarely remain consistent in topic, making it difficult to retrieve relevant information from a wide context window. In addition, larger context windows lead to increased computational cost and slower response time, making them impractical in many deployment scenarios.

Several approaches have been proposed to address this challenge:

  1. Retrieval-augmented Generation (RAG): Store the conversation history as documents and retrieve relevant chunks when needed.
  2. Memory-augmented models: Create dedicated architectural modifications to support persistent memory.
  3. Hierarchical memory system: Organizes memory in a hierarchical structure similar to the human memory system.

However, these approaches often struggle with scalability, efficiency, or the ability to maintain coherent reasoning over extended conversations.

Mem0: A memory-centric architecture

Mem0 is a novel memory-centric architecture designed to dynamically capture, integrate, and retrieve salient information from an ongoing conversation. The system operates in two main phases:

 3: Mem0 architecture, showing the fetch and update phases of the memory system.

Extraction phase When processing a new message pair (user message and assistant response), Mem0:

  1. Retrieve conversation summaries and recent messages from conversation history
  2. Using LLM-based extraction features to identify salient memories in new communication
  3. Consider the broader context of the conversation and extract only relevant and important information

The extraction process is designed to be selective, capturing only information that may be needed for future interactions while filtering out trivial or redundant details.

Update phase For each extracted fact, Mem0:

  1. Evaluate its consistency with existing memory to maintain consistency and avoid redundancy
  2. Retrieve semantically similar memories from the database
  3. These memories are presented to the LLM through a function call interface (tool call)
  4. Determine the appropriate memory management action:
  • ADD: Create new memory
  • UPDATE: Enhance existing memory
  • DELETE: Delete outdated or incorrect memory
  • NOOP: No modification required

This approach allows for dynamic memory management that evolves as a conversation proceeds, similar to how humans consolidate and update their understanding over time.

Mem0g: Graph-based memory representation

Based on the Mem0 architecture, Mem0g introduces a graph-based memory representation to capture complex relational structures. In this enhanced system:

Figure 4: Mem0g architecture with graph-based memory representation.

Graph structured memory is stored as a directed labeled graph with:

  • Entities as nodes (each node has a type classification, embedding vector, and metadata)
  • Relations as edges connecting nodes in a two-stage extraction process
  1. Entity Extractor : Identify key entities and their types from the input text
  2. Relation Generator: Infers meaningful connections between entities and builds relation triples (subject, predicate, object) to capture semantic structure Conflict Detection and Resolution When integrating new information, the system:
  3. Identify potential conflicting relationships in an existing graph
  4. Use an LLM-based update parser to determine if certain relations should be marked as obsolete
  5. Maintaining temporal consistency in knowledge graphs

This structured approach enables more sophisticated reasoning about complex, interconnected information than flat memory representations.

--- END ---