Parameters less than 1B crush large models? Tool called RAG dark horse comes out

The Pleias-RAG series of models breaks the mold with small parameter size and excellent performance!
Core content:
1. The main contributions and features of the Pleias-RAG series of models
2. Innovative workflows in the model inference process
3. Intermediate training strategies and the generation and application of synthetic data
❝In a nutshell, this paper is like "the Swiss Army knife of academia: it can search and reason, and can switch between multiple languages like Dove, but its parameter scale is as small as a toy."
1. Paper Analysis: Contributions, Difficulties, and Concept Dependencies (Stage 1)
1.1 Analysis of main contribution points
The paper proposes a new family of "small reasoning models" (Pleias-RAG series), which demonstrates two small models for retrieval-augmented generation (RAG) tasks: Pleias-RAG-350M and Pleias-RAG-1B. They can still provide excellent retrieval and reasoning capabilities at a relatively small parameter scale (hundreds of millions to billions), and have high usability in multilingual environments. The model has built-in "citation and original citation fragment" function when generating answers . Unlike the common post-citation method, these models will directly generate citation tags for documents or source texts when generating answers (for example, using the form <ref name=...>
), and can automatically insert the corresponding original text fragment into the same answer. This mechanism greatly improves the traceability and verification of the answer.The paper on perfect RAG workflow integration capability designs a structured "reasoning sequence" (or multi-step reasoning template) for the model, from analyzing user questions, determining whether retrieval is needed, multi-language switching, to citation integration, answer draft generation, and citing and rearranging information. This series of steps are reflected end-to-end in the model reasoning process. Improving model performance with large-scale synthetic data mid-training The authors use a large amount of synthetic data (millions of RAG samples and corresponding reasoning process examples) to "mid-train" the model, so that the model can learn complex retrieval, multi-hop reasoning, and automatic citation capabilities. These data include multi-language, multi-domain information fragments and corresponding questions and answers, which are used to simulate retrieval and answering in real scenarios. Multilingual Capabilities and Lightweight Deployment Compared with open models of equal or larger parameter sizes, the Pleias-RAG series has little loss in accuracy on RAG tasks for major European languages such as French, Italian, and German. And due to the small size of model parameters, it is easy to deploy on constrained hardware environments (e.g., mobile devices, Raspberry Pi).
1.2 Identification of Difficulties in Understanding
RAG (Retrieval-Augmented Generation) principles and details Although RAG is a very popular technology in recent years, the paper specifically designs the implementation details of RAG, including multi-round reasoning, reference formatting, content rearrangement, and "rejection logic". Therefore, to truly understand the paper, you first need to understand how it integrates multiple retrieved document fragments (sources) into the final answer and includes citations while generating the answer. Mid-training methods and synthetic data generation The paper does not simply fine-tune or post-train, but adopts a large-scale "mid-training" strategy (between basic model training and downstream task fine-tuning). How to build, filter and use massive amounts of synthetic data to enable the model to learn multi-step reasoning and reference generation is a relatively complex process. To understand this part, you need to master:
Structure of synthetic data How to simulate real retrieval scenarios with multi-language and multi-domain samples How to ensure that the model learns citations, paragraph abbreviations, and multi-step reasoning during training
In-text Citation MechanismGeneral search-based dialogue systems often perform "post-processing" to match citations after the model generates answers, but the approach proposed in the paper is to "generate citation information and some original text fragments during the reasoning process." To understand this mechanism, you need to understand how the model considers "text coherence" and "correctness of citation format" at the same time when generating, and how to partially intercept the original text for "excerpting." The paper Structured Multi-step Reasoning Framework designs a "standardized workflow" for model reasoning, including: Query analysis Source analysis Draft + citation generation The final answer to how this chained reasoning is implemented in a small model is a technical highlight of the paper. Understanding this structured process can help readers understand why the small model can achieve high accuracy and reliability in retrieval scenarios. The paper on multilingualism and language switching specifically points out that Pleias-RAG is stable in multilingual environments (especially major European languages) and can perform cross-lingual retrieval or answers. This means that it is necessary to understand how the model learns the word segmentation and syntax of each language in the training data, and how to mix or switch during reasoning and citation. Basic concepts of RAG → Structured multi-step reasoning framework → Built-in citations and original text excerpts You must first understand what retrieval-augmented generation (RAG) is and its common problems, such as "how to integrate retrieval results into answers" and "how to avoid hallucinatory answers", before you can understand why the multi-step reasoning and citation mechanisms proposed in the paper are important. Synthetic data generation method → mid-training strategy → Reasoning and citation ability of small models The paper spends a lot of space to explain how to generate high-quality and diverse synthetic data to enable the model to master retrieval and answering patterns in different languages and scenarios. Understanding the process of data generation and mid-training helps us understand why "small models can also have strong reasoning ability." Multilingual → Rearrangement and citation → Accuracy and traceability of the final answer In the process of supporting multilingual answers, the model must correctly insert citations and original text fragments while generating answers, and also ensure that the language context is consistent. These are all relatively difficult generation tasks that require a comprehensive understanding of all the previous elements. How “mid-training” can achieve multi-step reasoning and reference learning with large-scale synthetic data How are “structured multi-step reasoning” and “built-in references” uniformly designed and implemented in small models? Search and answer process with multi-language support You (the model) need to answer “how to get from the current city to a more scenic destination”, which is equivalent to the question raised by the user . You have at your disposal a collection of transport information, maps, tourist guides, etc., which are equivalent to multiple documents or sources retrieved . You need to check back and forth between multiple data (compare ticket prices, route feasibility, accommodation information, etc.), which is equivalent to multi-step reasoning and information integration . In your final travel plan, you must "mark where each idea or data comes from" and quote key route sections, costs and other information intactly in the itinerary, which is equivalent to generating answers with reference marks and original excerpts . “Travel goal to be planned” ↔ “User’s question( query
)”In the travel scenario, you are going to a specific city or attraction; in the context of the model, the user’s problem is the “goal” that needs to be solved in the end. “Available transport information, maps and guides” ↔ “Retrieved documents or sources ( sources
)”These materials are equivalent to multiple pieces of available information, some of which are highly relevant and some are purely noise. RAG will select the most useful ones. “Compare routes, accommodation costs, etc.” ↔ “Structured reasoning sequence” The model needs to first analyze the problem, then filter the information fragments, then organize feasible solutions, and finally integrate and answer. “Include references in your itinerary and quote key fares or timetables” ↔ “In-text citations and quotes” When generating answers, the RAG model also marks the source of the information (e.g. <ref name="source_x">
) and include the original text fragments related to the core conclusion.“Write a complete travel plan for the people traveling with you” ↔ “Finally generate a text with both answers and citations” The output of the model is not only a one-sentence conclusion, but also includes precise citation tags and excerpt content, which is convenient for "peers (users or reviewers)" to check. Original mathematical form:
here Represents the retrieved document collection; Indicates that the retrieval model is Select a document probability; Indicates known issues and Documentation Generate answers probability. Symbol replacement version (natural language interpretation):
"After the user asks a question, the model will retrieve multiple documents for the question. These documents may contain clues to the answer or may be interference items. We use 'the credibility of the retrieval subsystem selecting a document' to describe this process," corresponding to . "After receiving the specific document, the model formulates an answer based on the document and the question, and obtains 'document-based The probability of generating the answer '", corresponding to . "Adding up the contributions of all documents gives a comprehensive probability of the answer", corresponding to The summation process. Key implementation steps:
Search/Filter : Based on user questions , using vector search, BM25 or other retrieval algorithms to select the most relevant documents from a large number of documents . Multi-step reasoning : The model analyzes and judges questions and documents in turn (called "query analysis", "source analysis", etc. in the paper), and scores, sorts, or selects citations for each document internally. Answer and citation generation : When outputting the answer, the corresponding citation tag (such as <ref name="source_x">
) and original text excerpts are automatically inserted into the generated text.Search/Filter → Find travel information You will first search for various guides and transportation ticket information on the Internet. This step is similar to the "Based on Retrieving a collection of documents " is corresponding. Multi-step reasoning → Compare routes, accommodation, prices You may first consider how to go with "Route 1", and then compare it with "Route 2", eliminating inappropriate options in the process. This corresponds to the model's internal analysis of multiple documents, scoring, and selection of useful information. Answer and citation generation → Write out itinerary + indicate the source When you share your final itinerary with friends, you will write "from XX transportation official website" or "according to the railway station website information" in the itinerary. This corresponds to the model's "embedded in the answer <ref>
Tag, extract the original paragraph of the document".The sum of probabilities in the formula → weighted results for different routes During your trip, you may partially borrow the itinerary from "Route 1", partially adopt the hotel plan from "Route 2", and finally make a comprehensive decision. This is the sum of contributions to multiple documents. limitation In the metaphor, people usually only check limited transportation routes and accommodation reviews, but in the actual RAG scenario, it is necessary to quickly search and filter through massive documents, and it also involves complex dimensions such as multiple languages. Therefore, using "travel" to understand multi-step reasoning and reference mechanisms is figurative, but the actual model execution scale is larger and the process is more complex. Core connections : User Question ↔ Travel Purpose Search Documents ↔ Transportation/Accommodation/Tourism Information Multi-step reasoning and integration ↔ Repeated comparison of routes Built-in citations and original excerpts ↔ Indicate the source of the information, quote key fare schedules Key Takeaways : This correspondence makes it easier for readers to understand why the model generates answers while citing them, and also understand the necessity of multi-step reasoning - in scenarios with a large number of documents, the accuracy and traceability of the answers cannot be guaranteed without structured processing. At the mathematical formula level, we use It expresses the common idea of RAG: first retrieve the document, then integrate the document information to generate the answer. Built-in citations make the answer more "verifiable". Once the reader wants to "really see if the fare for a certain route is so cheap", they can go back to the original text (or official website). The model has learned this process through training to ensure the accuracy of the citation and the coherence of the overall answer. Input reception : The model receives the user's query ( query
) and optionally the documents/segments retrieved or to be retrieved (sources
).Multi-language Identification & Query Analysis : The model first determines the query language and then performs “query analysis” internally to understand the question type and determine whether more information needs to be retrieved or whether it can be answered directly. Retrieval/Document Selection (optional) : If retrieval is required, the model or system will query
, obtain and return several document sources from a knowledge base or search engine.Source analysis : Evaluate the documents obtained (whether they are highly relevant to the problem, whether they complement each other, etc.) and generate a preliminary arrangement of these documents internally, such as chronological order, paragraph marking, citable fragments, etc. Draft & Citation : The model formally generates a draft answer and simultaneously inserts citation tags (e.g. <ref name="source_x">
) and necessary original text excerpts. During this process, if the model finds that the information is insufficient, it may refuse to answer or prompt further inquiry.Final answer generation : The output contains not only the answer text but also traceable references; in multilingual scenarios, the answer will be presented in the same language as the input (unless otherwise required by the policy). Input reception User question (Query): "¿Cuándo comenzó la Segunda República Española?" Since there is already a preliminary retrieval subsystem in the deployment environment, the model knows internally that if it is not enough, it will request more documents; if it is enough, it will directly perform the analysis. Multi-language recognition & Query analysis The model automatically detected that this is a Spanish question. After performing "query analysis", the model concluded that it needed to answer the date of a historical event and judged that this was information that could be found in the indexed historical documents. Search/Document Selection (optional) If the system determines that a search is needed, it submits the keyword "Segunda República Española" to the historical document index and obtains several paragraphs. Assume that a total of 3 relevant texts are retrieved ( source_1
,source_2
,source_3
) and some distracting text.Source analysis source_1
: Mentioned that the beginning of the Second Spanish Republic was April 14, 1931;source_2
:Briefly describe the main leaders of this period;source_3
: Discusses contemporary European political developments, without specifying dates.Model "reads" one by one source_1
,source_2
,source_3
, filter out or mark irrelevant information. For example:The model will source_1
It is regarded as the most core source of citation, and the original text fragments that need to be cited are recorded internally.Drafting & citation First write in your answer: "The Spanish Republic came on April 14, 1931...". Also insert a reference tag, e.g. <ref name="source_1">
.If there is a sentence in the source text describing "...la Segunda República se proclamó el 14 de abril de 1931...", the model will select the key part and insert it in the form of "original text excerpt", such as <ref name="source_1">...La Segunda República se proclamó el 14 de abril de 1931...</ref>
.The model starts generating a draft answer: If the model finds that the remaining two sources are not helpful for accurate dates or conclusions, they are only briefly mentioned; or they are only cited to supplement the content. Final answer generation The final answer output by the model may be:
❝
Respuesta : La Segunda República Española comenzó el 14 de abril de 1931 .
This completes a full multi-step reasoning process and gives a citation.
1.3 Concept Dependencies
To sum up, if we want to analyze it in depth, the concepts and modules that need to be explained most clearly are roughly:
In the in-depth explanation of the next stage (Phase II), it is recommended to focus on the core concept of "Retrieval-enhanced generation mechanism with built-in citations" (which involves several key points such as structured reasoning, citation insertion, and multi-language support), because it can best demonstrate the innovative characteristics and technical difficulties of the paper, and also best reflect the value of intermediate training with synthetic data.
Next, we can use this core concept ("RAG multi-step reasoning mechanism with built-in references") as a starting point and make metaphors and in-depth explanations according to the framework of the second stage.
2. Detailed explanation of the core mechanism: built-in reference RAG multi-step reasoning (Stage 2)
This stage focuses on the core concept we selected in the first stage - "RAG multi-step reasoning mechanism with built-in citations" - and provides an in-depth explanation. This stage will use life metaphors to help readers understand this complex mechanism, and explain it through appropriate mathematical formulas (combined with the typical RAG principles involved in the main theme of the paper). Since the original paper does not explicitly give mathematical formulas, the following formula refers to the general retrieval enhancement generation (RAG) technology principle, and readers can regard it as a typical representation example of the "RAG process with built-in citations" in the paper at the mathematical level.
2.1 Designing life metaphors
To understand the “RAG multi-step reasoning mechanism with built-in references”, you can compare it to planning a trip across cities :
In this process, you must decide where to go (solve the core problem), explain "why you choose this route" and provide corresponding references (for readers, this can verify reliability and traceability).
2.2 Establishing the correspondence between metaphor and actual technology
Let's match the key elements of the travel metaphor above with the technical points in the "RAG multi-step reasoning mechanism with built-in references":
2.3 In-depth technical details (including formula examples)
When formally explaining the technology, the retrieval and generation process of RAG can usually be abstracted into a typical formula . Although the paper does not give a clear mathematical form, the following formula can reflect the general probabilistic structure of the "built-in citation RAG" mentioned in the paper. Assume that the user input is (query), the answer is , the retrieved candidate document set is , then the core formula usually used in RAG can be written as:
The original formula is given below, and then the symbols are replaced with natural language to facilitate readers' understanding:
2.4 Mapping Technical Details to Metaphors
Let’s go back to the metaphor of “cross-city travel” and see how these technical steps map to real-life scenarios:
2.5 Summary: Combining Metaphor and Technology
Through the metaphor of "cross-city travel", we can intuitively understand the "built-in reference RAG multi-step reasoning mechanism" proposed in the paper:
At this point, readers have a more vivid understanding of the technical principle of "built-in reference RAG multi-step reasoning". In the next third stage, we will explain in detail how to go from input to output based on the specific process provided by the paper, and give examples or pseudo-code-style process descriptions. By then, readers will be able to more clearly understand how the complete method proposed in the paper is implemented in a "pipeline" manner.
3. Implementation process analysis: from input to output (Stage 3)
This stage aims to describe in detail the processing flow of the complete model or solution proposed in the paper from input to output. Based on these descriptions, readers can accurately restore the specific steps of the method or process, and even write corresponding pseudocode.
3.1 Process Overview
The "RAG multi-step reasoning mechanism with built-in references" mentioned in the paper can be considered as a complete pipeline when actually deployed, which usually includes the following main stages:
In most practical applications, step 3 (retrieval/document selection) may be performed by external components (such as BM25, vector index, etc.), and then the search results are fed into the model. The outstanding innovation in the paper is the close coordination between steps 4-5 : the model does not wait until the answer is generated before doing "post-processing", but "inserts references while reasoning".
3.2 Example Demonstration: From Input to Output
Here is an example to demonstrate how this RAG multi-step reasoning mechanism works when a user asks a question. The example is from the multilingual scenario described in the paper: the user enters a history question in Spanish.
3.3 Pseudocode description of the process
The following pseudo code illustrates a typical "built-in reference RAG multi-step reasoning process". This process separates retrieval from model reasoning, but implements reference insertion within the model:
# Input: user question Q; (optional) external retrieval component fetchDocuments()
function RAG_with_intext_citation(Q):
# Step 1: Detect language
lang = detectLanguage(Q)
# Step 2: Query Analysis
queryAnalysis = analyzeQuery(Q)
if queryAnalysis == "need_more_info":
# Step 3: Retrieval (external or built-in)
candidateDocs = fetchDocuments(Q)
else:
candidateDocs = alreadyProvidedDocs(Q) # Use existing documents
# Step 4: Source Analysis
relevantDocs = []
for doc in candidateDocs:
relevanceScore = estimateRelevance(doc, Q)
# Note: The > below is not replaced because it is in an if statement in pseudocode, not a mathematical formula
if relevanceScore > threshold:
relevantDocs.append(doc)
# Sorting
relevantDocs = sortByScore(relevantDocs)
# Step 5: Draft & Citation
# Let the model generate "answer + reference" once or multiple times
# Pseudo instructions for the model:
draft = generateDraftAndCitations(Q, relevantDocs, lang)
# If the generation process requires multiple rounds of reasoning, you can loop here:
# while not satisfied(draft):
# draft = refineDraftWithCitations(draft, relevantDocs)
# Step 6: Finalize and Output
finalAnswer = finalizeAnswer(draft)
return finalAnswer
# Example of a function that generates a draft with a reference
function generateDraftAndCitations(Q, docs, lang):
# Quote insertion example logic: traverse the highly relevant fragments in docs
# And extract the most core original text to the answer
answerText = ""
# Prepare model input: including question Q, related documents docs, and generation instructions
model_input = prepare_model_input(Q, docs, lang, instruction="Generate answer with in-text citations.")
# Model generates output (assuming the model can generate quoted text as instructed)
raw_output = model.generate(model_input)
# Parse and format the model output (if necessary)
answerText = parse_and_format_output(raw_output)
# If the model cannot be completed in one go, more complex interactions or post-processing may be required
# (The following is a simplified representation, which may actually be integrated into the model generation logic)
# for doc in docs:
# neededSnippet = extractRelevantSnippet(doc, Q) # Extract relevant snippets from the document
# # Note: The following 'is not empty' is not a mathematical symbol
# if neededSnippet is not empty:
# # Insert citation tags and part of the original text in the answer (the model should complete this step by itself)
# # Note: The following < and > are not mathematical comparison operators in string literals.
# refTag = "<ref name=\"" + doc.id + "\">" + neededSnippet + "</ref>"
# # answerText += ... (should have been included when the model was generated)
# If there is any default information to supplement or rejection logic, it can also be executed here
# Note: The following 'is empty' and 'or' are not mathematical symbols
if answerText is empty or lacks_confidence(answerText):
answerText = generate_refusal_or_clarification(lang)
return answerText
# (other auxiliary functions such as detectLanguage, analyzeQuery, fetchDocuments,
# estimateRelevance, sortByScore, finalizeAnswer, prepare_model_input,
# parse_and_format_output, generate_refusal_or_clarification, etc. need to be implemented)
Readers can adapt this pseudocode to real projects based on their own retrieval systems and model APIs. The key point is that when the model generates the "answer text", it does not splice the references in the post-processing stage, but puts them in the generation stage. <ref>
The label is written together with the quoted text .
3.4 Key Notes and Summary
Multi-language compatibility :
In the above process, detectLanguage(Q)
The language-related part is very important. If there is a mix of languages in the answers and quotations (for example, the userquery
is in Spanish and the document is in English), the model needs to be able to do cross-language understanding or translation.
exist generateDraftAndCitations()
In the process of quoting, the model should match the fragments of the quoted document as accurately as possible; once the quoted text does not match the original text, it will destroy the core "verifiability" value of this feature.
Small models usually have less spacious context windows than large models. If the retrieved documents are too many or too long, how to segment, summarize, or read and write multiple times is also a training challenge mentioned in the paper.
If all documents cannot answer the question, the model should return appropriate rejection information. Or when the first draft is incomplete, the model can "reason multiple times" to supplement the reference fragments. These all require additional processing in actual deployment.
The above process fully presents the basic operations from "receiving user input" to "outputting answers with citations". Each step is consistent with the ideas of the paper: relying on retrieval, using multi-step reasoning, and using built-in citation tags to ensure traceability and multi-language support. Readers can roughly reproduce the "built-in reference RAG multi-step reasoning mechanism" mentioned in the paper by simply following this process description.
At this point, combined with the "concept analysis" and "metaphor + formula" explanations in the previous two stages, as well as the process description in this stage, readers can gain a more comprehensive understanding and operational guidance on the "RAG multi-step reasoning mechanism with built-in references" proposed in the paper at the theoretical and practical levels.