Data Engineering: The Cornerstone of the RAG System

Written by

Clara Bennett

Updated on:June-19th-2025

Preface

Retrieval-Augmented Generation (RAG) systems aim to provide more accurate and context-aware answers by combining external knowledge sources with the generative power of large language models (LLMs). We detailed the evolution of RAG from Naive RAG to Agentic RAG in our previous article “Agentic RAG”.

An excellent RAG system must be a complex system, which involves many aspects of consideration. At the macro level, from indexing, retrieval, to enhancement, and finally the output of the large model, each link is very important. Each link has a corresponding technology selection, such as Embedding model, vector database, Rerank model, and finally LLM. It is very simple to build a simple RAG question-answering system. With the help of AI, one sentence can help you generate:

import osfrom langchain_community.document_loaders import TextLoaderfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_openai import OpenAIEmbeddings, ChatOpenAIfrom langchain_community.vectorstores import FAISSfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.runnables import RunnablePassthroughfrom langchain_core.output_parsers import StrOutputParser # --- 1. Setup your OpenAI API key --- # Make sure you have set the environment variable OPENAI_API_KEY # Or set it here: # os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" # --- 2. Prepare and load the document --- # Assuming you have a file called my_document.txt # You can create this file manually and add some text with open("my_document.txt", "w", encoding="utf-8") as f: f.write("Langchain is a powerful framework for building applications based on large language models.\n") f.write("RAG system combines the power of retrieval and generation.\n") f.write("FAISS is an efficient similarity search library.\n") f.write("OpenAI provides advanced language models and embedding models.\n")loader = TextLoader("my_document.txt", encoding="utf-8")documents = loader.load()# --- 3. Split documents ---text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)texts = text_splitter.split_documents(documents)# --- 4. Create text embedding model ---# Make sure your OPENAI_API_KEY environment variable is settry: embeddings = OpenAIEmbeddings()except ImportError: print("OpenAIEmbeddings Not found. Please make sure langchain-openai is installed and OPENAI_API_KEY is set. ") exit()# --- 5. Create vector store ---# FAISS is an in-memory vector database. You can also choose others, such as Chroma, Pinecone, etc. try: vectorstore = FAISS.from_documents(texts, embeddings)except Exception as e: print(f"Error creating vector store: {e}") print("This may be due to invalid OpenAI API key or network problems.") exit()# --- 6. Create retriever ---retriever = vectorstore.as_retriever(search_kwargs={"k": 2}) # Retrieve the 2 most relevant document blocks# --- 7. Create prompt template ---template = """Answer the question based on the following retrieved context. If you don't know the answer, say you don't know. Don't try to make up an answer. Use a maximum of three sentences and keep your answer concise. Context: {context} Question: {question}Helpful answer: """ prompt = ChatPromptTemplate.from_template(template) # --- 8. Create a large language model --- llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7) # --- 9. Create a RAG chain ---# LCEL (Langchain Expression Language) rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()) # --- 10. Run the RAG chain and ask questions --- if __name__ == "__main__": print("RAG system is ready. Please enter your question:") while True: user_question = input("You: ") if user_question.lower() in ["Quit", "exit", "quit"]: break if user_question: try: response = rag_chain.invoke(user_question) print(f"AI: {response}") # If you want to view the retrieved context, you can do this: # retrieved_docs = retriever.invoke(user_question) # print("\n--- Retrieved context ---") # for i, doc in enumerate(retrieved_docs): # print(f"Document {i+1}:\n{doc.page_content}\n") except Exception as e:print(f"An error occurred while processing the request: {e}") else: print("Please enter a question.")

However, this simple RAG system is limited to demos. In real personal and corporate scenarios, the processing of information, data, and knowledge and AI empowerment are much more complicated than this. When LLM makes the final output, it essentially uses reference knowledge fragments as part of the prompt words. The quality and organization of the data directly determine the relevance and accuracy of the subsequent retrieval information, which in turn has a decisive impact on the quality of the content generated by LLM. Today we will disassemble the first and most critical step of RAG in detail: data engineering .

1. Data loading

The first step in the RAG indexing process is to load data from various sources. This data may be unstructured, semi-structured or structured. Ensuring that the data is ingested accurately and completely is critical to the performance of RAG.

Challenges of unstructured data. For unstructured data formats, the term “data loading” often masks a complex, format-specific sub-process involving multiple specialized tools (e.g., OCR, layout analysis, table extraction, image caption generation). The amount of engineering effort here cannot be underestimated, and a general “loader” is not sufficient to address these challenges:

PDF files : PDF files are diverse: they may be text-based, image-based (scanned), or contain complex layouts such as tables, charts, formulas, and multi-column text; simple text extraction often fails for scanned PDFs (OCR required) or PDFs with complex layouts;
DOCX/Office files : While generally more structured than PDFs, they may contain complex elements such as embedded objects, tables, and revision history;
HTML files : require powerful parsing capabilities to remove boilerplate code (navigation bar, footer, ads) and extract meaningful content, usually retaining some structural tags (such as title, list) as metadata or included in the slice itself;
Multimodal documents : Processing images in RAG usually means extracting embedded text (OCR) or generating text descriptions/captions using vision models.

Handling semi-structured and structured data. For structured/semi-structured data, the core difficulty is not in parsing the format itself, but in converting its inherent structure (rows, columns, keys, nesting) into a linear text representation (slice) while retaining semantic meaning and not losing key relational information so that LLM can understand these data formats and their characteristics:

JSON / CSV files : Although structured, the challenge is to meaningfully transform tabular or nested data into text slices that LLM can understand. This may involve serializing rows, aggregating data, or extracting specific fields based on patterns;
Tables (standalone or in-document) : require specialized parsing to maintain row/column relationships, and effectively slicing tables is a unique challenge in itself;
Codebases : Indexing code requires parsing the abstract syntax tree (AST) or using a language-specific parser to create meaningful slices (e.g., functions, classes) rather than arbitrary splits. Metadata such as function signatures, dependencies, or docstrings are crucial.

2. Data Cleaning

Data cleaning is not a one-time universal step, but an iterative process that must be adjusted according to the specific data source and the goals of the RAG system. Overly aggressive cleaning may remove important context clues. Different data types and sources will have different "noise" characteristics. Data cleaning mainly includes the following parts:

Cleaning : removing irrelevant content (boilerplate, ads, navigation bars, special characters, extra spaces, and correcting errors (spelling mistakes, grammatical errors);
Normalization : standardizing text (e.g., capitalization, date formats, units of measure) and data formats to ensure consistency;
Deduplication : Identifying and removing duplicate or nearly duplicate documents/slices is useful for preventing retrieval result bias and reducing storage/processing costs;

Noise Reduction : Filter out irrelevant parts or documents that are not relevant to the purpose of the knowledge base.

3. Data Blocking

Large language models (LLMs) have context window limitations, while smaller and more focused slices can improve retrieval accuracy. Data segmentation is crucial, and the choice of segmentation strategy profoundly affects retrieval results and generation quality.

Basic segmentation methods. Basic segmentation methods are easy to implement, but they often involve a trade-off between computational efficiency and semantic integrity. Their main disadvantage is that they may lead to semantic fragmentation:

Fixed-size segmentation : Split the text into segments of predetermined length, usually with overlapping parts;
Recursive character/text segmentation : parameterized by a set of preset characters (such as line breaks), and try to segment in sequence until the slice is small enough. The goal is to keep the integrity of paragraphs, sentences, and words as much as possible;
Sentence-based segmentation : Split the text based on sentence boundaries, with each slice containing one or more complete sentences.

Advanced and context-aware segmentation methods. Advanced segmentation techniques reflect a fundamental shift from reliance on surface-level syntactic clues (fixed length, sentence terminators) to a deeper semantic understanding of the content. This requires more complex data engineering (integrating embedding models, NLP libraries, or even LLMs into the segmentation process), but is expected to produce higher quality slices:

Semantic segmentation : Use natural language processing techniques, usually embedding models, to identify natural breakpoints of meaning or topic in the text and group semantically related sentences/content together;
Document structure-aware segmentation : Use explicit structural elements in the document, such as Markdown titles, HTML tags, chapters, paragraphs, lists, or tables as segmentation boundaries;
Agent segmentation : LLM determines the appropriate document segmentation points based on semantic meaning and content structure (paragraph type, chapter title, instruction steps, etc.);

Hybrid methods : combine multiple strategies, for example, first perform structured segmentation and then perform semantic segmentation within large chapters, or fixed-size segmentation combined with semantic analysis for boundary adjustment.

Slicing vs. accuracy tradeoff. There is no universal "optimal" slice size or overlap. The specific method must be adjusted according to the data characteristics, embedding model, retrieval strategy, LLM context window, and specific task requirements, usually through iterative evaluation. The main considerations are:

Smaller slices : Higher retrieval accuracy, easier to pinpoint specific facts, better signal-to-noise ratio, faster processing, but may also lack sufficient context for the LLM to fully understand or generate a comprehensive answer;
Larger slices : provide more context, which may lead to better generation quality and coherence. However, they may dilute the retrieval relevance signal, contain more noise, and consume more LLM context windows;
Slice overlap : Repeat a certain amount of text at the end of the previous slice at the beginning of the adjacent slice to ensure contextual continuity between slice boundaries and prevent loss of information or meaning when a concept spans multiple slices, but too much overlap will increase redundancy and storage costs.

4. Data Augmentation

In addition to basic segmentation, a variety of techniques can be employed to enrich the data to improve retrieval accuracy and provide better context for LLM. These techniques usually involve generating additional information from or about the slices themselves, with the disadvantage of increasing preprocessing costs.

Enrich metadata. Metadata is additional information describing a document/slice, such as source, author, creation/modification date, keywords, subject, chapter title, page number, URL, content type tag. Metadata can be extracted manually, rule-based, or automatically using NLP techniques (such as named entity recognition NER, topic modeling);

Semantic enhancement. Generating concise summaries for each slice and hypothetical question-answer pairs for the slices, semantic enhancement aims to create richer data representations that better match the query intent, potentially improving the retrieval of subtle information or information expressed in atypical query terms.

Knowledge graph. Represents information as entities and relationships, identifies entities (people, places, organizations, concepts) mentioned in text slices, and graph networks consisting of relationships, allowing queries to target specific entities to improve retrieval accuracy.

Contextual retrieval. Prepend slice-specific explanatory context to each slice, preserving the relationship between the slice and its broader document context.

These advanced techniques represent a shift from viewing slices as isolated bags of words/terms to representing them as components in a larger, structured knowledge network. This allows for more nuanced and precise retrieval based on relationships and canonical entities, rather than just keywords or superficial semantic similarity.

5. System Considerations

RAG Data projects are not static, one-time builds. They are "living systems" that must adapt to new data, changing data formats, and scaling requirements. This requires an architectural approach that prioritizes adaptability and evolutionary design from the outset. Knowledge bases are rarely static; new documents are constantly added and existing documents are updated or become outdated.

Scalability : Data engineering must be able to handle growing data volumes and update frequencies;
Maintainability : clean code, modular design, good documentation, and robust error handling are essential for increasingly complex processes;
Modularity : Designing processes such as loading, cleaning, slicing, and enhancing as a series of independent, interchangeable modules makes it easier to update, test, and experiment with different components;
Operational efficiency : Automating processes, implementing robust logging and monitoring, and optimizing resource utilization can help improve operational efficiency and control costs.

The “black-box” nature of many core indexing steps (such as embedding quality or the “retrievability” of segmentation strategies) necessitates measuring their effectiveness through indirect evaluation. Their effectiveness is often inferred through the performance of downstream retrieval tasks. This makes evaluation complex and iterative. It is also critical to establish a framework for monitoring, evaluating, and iteratively optimizing data engineering:

Monitoring : Tracking process health, processing time, resource usage, and error rates;
Data quality indicators : (implicit) noise level, deduplication rate, normalization consistency;
Slicing effect : context relevance, answer relevance, fidelity, semantic coherence within slices, and slice size distribution;
Metadata impact : assessing whether metadata filters improve search accuracy, or whether metadata enrichment leads to better contextual association;
Iterative optimization : Use the evaluation results to improve the data strategy (segmentation parameters, metadata extraction rules), which is a continuous feedback loop.

VI. Conclusion

The root of many RAG system challenges is treating data preprocessing as a simple sequence of steps rather than a complex software engineering and data management problem that requires rigorous design, testing, and maintenance. It is necessary to identify and avoid common pitfalls in RAG index design and implementation:

Ignoring data quality : This is the most common pitfall. Failure to rigorously clean, normalize, and deduplicate data can lead to noisy and unreliable indexes.
Improper segmentation strategy : A segmentation strategy that is not suitable for the data type or query mode is selected;
Insufficient or incorrect metadata : Missing valuable metadata or inaccurate/inconsistent metadata can hinder filtering and contextualization;
Lack of iterative evaluation and optimization : treating data processing as a one-time project rather than a process of continuous improvement;
Scalability bottleneck : The designed process cannot scale with the increase in data volume or update frequency;
Ignoring cost implications : Implementing overly complex or computationally intensive data processing without considering cost-benefit tradeoffs.

Finally, back to the indexing module, the future trend of RAG indexing is to be more dynamic, multimodal, and intelligent. The current RAG indexing is mainly an offline batch processing process. Future trends point to "real-time RAG", which requires dynamic updates of the index. "Multimodal content" means that the index needs to process images, videos, and audio, not just text. The "hybrid model" that combines keyword, semantic, and graph search means a more complex, multi-faceted index. These trends collectively point to an indexing stage that is more deeply integrated with the retrieval and even generation stages, more automated, and able to handle greater complexity in data types and semantic relationships. This will further amplify the importance and difficulty of data engineering in RAG.

END