Pre-generated context: Reconstructing the key project of RAG and building the foundation for AI programming

Explore new trends in AI programming and learn how pre-generated context can optimize the RAG model.
Core content:
1. Application and advantages of pre-generated context in AI coding
2. Analysis of uncertainty factors in the RAG technical process
3. Challenges and optimization ideas faced in the indexing and retrieval stages
In the previous article "AI-friendly architecture: platform engineering empowers AI automatic programming", we mentioned that DevOps platforms should pre-generate a large amount of projects, templates, contexts and other information. In this article, we will elaborate on one of the core practices: pre-generating contexts.
In recent months, pre-generated context has become a hot topic, or a technical trend, in the field of AI coding. Developers benefit from these AI-generated Wikis to quickly understand the purpose and technical architecture of open source projects. Although, in terms of the depth of understanding, the current RAG-based and document retrieval methods need to be improved, and AST-based context analysis needs to be added to improve accuracy and coverage. But in terms of efficiency, the pre-generated context method has become a trend.
Therefore, we started to develop a new feature for AutoDev, Context Worker, which aims to improve the effect of RAG by pre-generating context. If you are interested in this feature, please join our GitHub: https://github.com/unit-mesh/autodev-work
Introduction: Everyone knows how to do RAG, but no one can do it well
In 2023, the emergence of the LangChain framework has simplified (and complicated) the development of AI applications in a sense, and RAG (Retrieval Augmented Generation) has also become a popular concept in the application development field. The core idea of RAG is to combine retrieval and generation, and enhance the ability of the generation model by retrieving relevant information.
Technical factors: uncertain RAG, random model generation
Each link in the RAG process may introduce variables, which accumulate to form an "uncertainty chain" that affects the final effect.
Typically, mainstream RAG implementations consist of two main phases:
Indexing stage: Split the unstructured and structured data (which cannot be understood by individuals) in a certain way, vectorize them (optional), and store them in the database.
Retrieval stage: The user input is converted and vectorized (optional), searched in the database, and the relevant context information is returned. After some processing (optional), it is passed to the generation model for generation.
There are a lot of uncertainties in this process, which leads to the fact that the effect of RAG is often not as good as expected.
Indexing stage: The quality of knowledge base construction directly determines the upper limit of subsequent retrieval . The core difficulties include: how to reasonably divide the blocks to retain semantic integrity (especially code boundaries), select appropriate embedding models for vectorization, and effectively pre-process multi-source data to avoid "garbage in, garbage out".
Challenges in the retrieval phase: Even if the user's query intent is clear, it is not easy to find relevant information from the index efficiently and accurately . Accurate retrieval requires resolving query ambiguity, improving search capabilities (such as hybrid search, HyDE), re-ranking results, and controlling context length to prevent information loss or interference generation.
Finally, because LLM itself introduces randomness when generating responses, the model’s understanding of the query intent and its interpretation of the retrieved contextual information may be biased, resulting in inconsistent or inaccurate output even for exactly the same input. RAG systems are highly dependent on the quality and relevance of external data; if the retrieval process is biased, the quality of the generation phase will inevitably be affected.
Business factors: A large amount of junk knowledge is affecting the effectiveness of RAG
Last year, when we talked to a lot of companies, we found that the massive amount of documents and code ("too much documentation") in the company posed a major challenge. The versioning of code and documentation was inconsistent. Generally speaking, the code that passed the pipeline is working, but it is difficult to 04 - 03.docx
and Revision 12.docx
Even if you can tell which is the latest version, it is difficult to tell whether it is the correct version.
As a result, the problem is further compounded by the variable quality (“uneven quality”) as the model may learn or retrieve information (user queries) from suboptimal or erroneous examples when processing on the fly. If not guided properly, generative AI may produce harmful, false, or plagiarized content, and these risks are amplified when the input data is large and uncurated (i.e., not pre-processed and filtered).
In this document data context, even with the best text segmentation, vector embedding, and retrieval algorithms, they can only operate on noisy, outdated, or erroneous data. As a result, RAG systems not only fail to solve the original information chaos problems of enterprises, but may even perpetuate or even amplify these problems.
Code retrieval: from structured retrieval to DeepWiki
Generative AI can solve the problem of creating new projects, but there are still many challenges in understanding and modifying existing code. Software engineers and AI researchers have different ways of thinking. Engineers focus on classic engineering thinking, while AI is more of a new generation of powerful flying bricks. The two ways of thinking have been constantly blending in the past few years.
Current mainstream AI programming tools adopt a variety of strategies for code retrieval. They usually combine traditional keyword- and structure-based methods with emerging AI technologies in order to strike a balance between speed, accuracy, and contextual understanding.
Key Information Retrieval: Engineering Thinking and Structured Indexing
When your search needs to be run on the client, you need to consider performance and efficiency. You need to consider how to perform efficient searches on the local machine instead of relying on the computing power of the cloud.
The code itself is structured, but that does not mean that the mainstream code retrieval method is structured. We mentioned in the previous article that there are two mainstream code retrieval methods:
Retrieval based on key information: such as Cline, Copilot, Cursor, etc.
Code and text based retrieval: such as Bloop, SourceGraph Cody, etc.
From the existing methods, the mainstream is the retrieval method based on key information. Such as:
Cline: AST (Abstract Syntax Tree) + Regular Expression Search
Copilot (2024): Generate keywords + TreeSitter AST (abstract syntax tree) to search for key information (class, method name, etc.)
Cursor: Ripgrep text search + cloud-based vectorization
Continue: SQLite-based text search + LanceDB local vectorization
…
In simple terms, they are all based on key information for retrieval, and only the class name, method name, variable name, etc. of the code need to be searched. Therefore, the user's input is a keyword or a sentence, which is then converted into query conditions by the model and finally summarized by the model.
However, keyword search is fast, but its understanding of user intent is limited and it is easy to produce irrelevant results. Although the retrieval based on abstract syntax tree can understand the code structure, it is insufficient in handling semantic similarity.
Key Documentation Generation: Pre-Generated Codebase Documentation
When your search can be run in the cloud, you also need to consider performance and efficiency issues. Pre-generating documents becomes a very good solution: it can solve the problem of large amounts of outdated documents and can improve search efficiency.
The following is our research combined with DeepResearch: Comparative Analysis of Pre-Generation Strategies in Code Documentation Tools
As we said at the beginning, although the current RAG-based document retrieval method still needs to be improved in terms of the depth of understanding, once it becomes a technological trend, a large number of tools and platforms will begin to support this trend.
However, these tools are very limited when faced with complex code or code containing implicit logic due to the lack of accurate code associations, for projects with little documentation or large projects.
Similarly, after the model capabilities are further enhanced, AI-friendly documentation is also a very important topic. The accuracy of the code in the documentation becomes even more important. Incorrect documentation and knowledge will make your AI assistant difficult to trust.
Pre-build context
Pre-generated context means that before the user initiates a query or generates a request, the system builds a set of structured context data offline for a specific code repository, document, or SDK. These contexts are understood, processed, and organized so that they can be quickly retrieved and referenced at runtime, thereby improving the accuracy, relevance, and response speed of the code agent when generating, interpreting, or retrieving code.
Its core elements include:
Extraction and parsing of documents and codes : including API documentation, source code comments, sample codes, change logs, etc.;
Semantic understanding and summary generation : extracting meta-information such as key capabilities, uses, and limitations;
Vectorization and index construction : Build embedding indexes for fast semantic retrieval;
Version binding and content update strategy : ensure that the context is always consistent with a specific version;
AutoDev Explore: AutoDev Context Worker
Based on the above ideas and research, we started to build AutoDev Context Worker. Combined with the capabilities of our AutoDev IDE plugin, the following are our basic ideas:
Deep project analysis and AST construction are structured . Context Worker performs deep analysis on the entire project (or specified module scope). This includes building a complete AST, identifying all functions, classes, interfaces and their signatures, and comments (docstrings). At the same time, it analyzes project dependencies (internal modules and external library dependencies) and builds a preliminary dependency graph.
Automatic code summary and "intention" annotation : For code blocks (functions, complex logic sections) that lack good annotations, try to use LLM to pre-generate concise summaries or "intention descriptions". For some key architectural components or core algorithms, specific tags or metadata can be pre-marked.
Build a project-level knowledge graph : parse the code entities (classes, functions, variables, etc.) and their relationships (calls, inheritance, implementations, references, etc.), and build a knowledge graph around the domain model, annotating the semantics and contextual information of the entities.
…
Through these pre-calculation and pre-organization work, Context Worker aims to provide high-quality, immediately available, and deeply structured context for AI-assisted R&D functions (such as code generation, defect repair, code refactoring, requirements understanding, etc.). It solves the problems of incomplete, inaccurate context, and low retrieval efficiency that may occur in traditional RAG when processing complex codes, thereby better realizing the concept of "AI-friendly architecture" and enabling higher-level AI automatic programming capabilities.
Summarize
This paper explores pre-generated context as a key mechanism to enhance AI programming capabilities. The uncertainty and knowledge quality issues faced by traditional RAG methods make pre-generated context a more reliable alternative. By comparing and analyzing the limitations of current code retrieval methods, we can see that although key information-based retrieval is fast, it has limited understanding, and pre-generated document tools such as DeepWiki have made progress but are still insufficient when dealing with complex code logic.
Pre-generated context represents an important practice of AI-friendly architecture, which combines the structured thinking of traditional software engineering with the generation ability of AI large models, paving the way for the next generation of intelligent programming tools. By actively building rather than passively retrieving code context, we can better enable AI to understand and modify existing code bases, allowing developers to focus more on creative work rather than repetitive tasks.