How to use existing question-answer data to build RAG

Written by
Silas Grey
Updated on:June-19th-2025
Recommendation

Efficient strategy analysis for building RAG using question-and-answer data.

Core content:
1. The particularity and structural advantages of question-and-answer data
2. Question-and-answer data storage strategy: the trade-off between completeness and granularity
3. Question-centered index construction strategy and practical suggestions

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


Today I saw friends discussing this issue in the group. I have also encountered similar QA knowledge bases in the past when I implemented them. I also vaguely feel that this problem occurs quite frequently. Let's continue with this topic and briefly summarize how to use existing question and answer data to build RAG .

Peculiarities of Question-Answering Data

First, let's think about the particularity of question-and-answer data. Unlike ordinary documents, question-and-answer data has its own unique structure and value. Each set of questions and answers contains a question and a corresponding answer, forming a complete information unit. This structured feature gives question-and-answer data a unique advantage when building a RAG system:

  1. The question section usually directly reflects the user's actual needs
  2. The answers are often high-quality, refined information.
  3. There is a clear correspondence between question and answer pairs, which facilitates retrieval and matching

Key strategies for building RAGs using question-answering data

1. Data storage strategy: completeness vs. granularity

In practice, there are different opinions on whether question-answering data needs to be split:

Complete retention strategy : directly store the question and answer pairs as a complete unit without segmentation. This method ensures the integrity of QA and is suitable for standardized FAQ scenarios.

Document 1:
{
    "Question""How do I reset my password?" ,
    "Answer""You can reset your password by following these steps: 1. Click the 'Forgot Password' link on the login page..."
}

Fine-grained segmentation strategy : split the longer Q&A content into smaller segments. This method may improve the sensitivity of retrieval, but may damage the integrity of QA.

Document 1-1:
{
    "Question snippet""How do I reset my password?" ,
    "Answer snippet""You can reset your password using the 'Forgot Password' link"
}

Document 1-2:
{
    "Question snippet""Steps to reset your password" ,
    "Answer snippet""1. Click on the 'Forgot password' link 2. Enter your registered email address..."
}

Practical suggestions :

  • For short and clear FAQs, it is recommended to directly enter the complete database
  • For complex and lengthy QA, you can consider splitting, but make sure that the segmentation does not destroy the semantic integrity.
  • Perform A/B testing in production to compare the effectiveness of two strategies

2. Index building strategy: problem-centered

Different from ordinary document RAG, the RAG system for question-answering data should be "question-centric" for index construction:

  1. Question vectorization : vectorize the question part as the main index content

    # Pseudocode example
    for  qa_pair  in  qa_dataset:
        question_embedding = embedding_model.encode(qa_pair[ "question" ])
        doc_id = vector_db.add_document(
            embedding=question_embedding,
            metadata={
                "question" : qa_pair[ "question" ],
                "answer" : qa_pair[ "answer" ]
            }
        )
  2. Dual indexing : index both questions and answers, but mainly relies on question similarity when searching

    # Pseudocode example
    question_embedding = embedding_model.encode(user_query)
    similar_docs = vector_db.search(
        embedding=question_embedding,
        search_field = "question" ,   #Specify the search in the question field
        top_k = 5
    )
  3. Hybrid search : Combine vector search and keyword search to improve recall quality

    # Pseudocode example
    vector_results = vector_db.vector_search(user_query, top_k= 3 )
    keyword_results = vector_db.keyword_search(user_query, top_k= 3 )
    final_results = merge_results(vector_results, keyword_results)

3. Retrieval and Generation Strategies

For RAG systems based on question-answering data, retrieval and generation strategies also require special design:

  1. Similar question retrieval : user queries are matched with questions in the question library by similarity
  2. Context assembly : Organize the retrieved question-answer pairs into contexts that can be used by LLM
  3. Flexible generation : The degree of freedom of LLM generation is determined by the quality of the search results
# Pseudocode example
def generate_answer (user_query) : 
    # Search for similar questions
    similar_qas = retrieve_similar_questions(user_query)
    
    # Determine strategy based on similarity score
    if  max_similarity_score >  0.85 :
        # High similarity: Use existing answers directly
        return  format_existing_answer(similar_qas[ 0 ])
    elif  max_similarity_score >  0.6 :
        # Medium similarity: generated based on existing answers
        context = format_context(similar_qas)
        return  llm.generate(prompt= f"Answer question based on: {context} \nQuestion: {user_query} " )
    else :
        # Low similarity: LLM is more creative
        context = format_context(similar_qas)
        return  llm.generate(prompt= f"Refer to the following content that may be related and answer the question creatively: {context} \nQuestion: {user_query} " )

Optimization techniques in practical applications

1. Data quality over quantity

In the RAG system, data quality is far more important than quantity. For question-and-answer data, the following measures can be taken to improve quality:

  • Standardize questions to reduce expression differences
  • Make sure your answers are accurate, concise, and comprehensive
  • Regularly update old Q&A content
  • Remove duplicate or highly similar question-answer pairs

2. Metadata Enhancement

Adding rich metadata to question-answer pairs can significantly improve retrieval results:

{
    "Question""How do I apply for a refund?" ,
    "Answer""You can click the 'Apply for Refund' button on the order details page..." ,
    "metadata" : {
        "Category" : [ "After-sales Service""Refund" ],
        "Applicable Products" : [ "Physical Goods""Digital Products" ],
        "Update time""2023-12-01" ,
        "Question alias" : [ "How to refund""Refund process""How to refund" ]
    }
}

This metadata can be used to:

  • Question extension and enhancement
  • Multi-dimensional filtering of search results
  • Sorting and reordering results

3. User feedback closed loop

Establish an effective user feedback mechanism and continuously optimize the system:

  • Record whether the user adopts the system's answer
  • Collect user comments on answers
  • Analyze the questions that cannot be answered effectively and add relevant QA in a timely manner
  • Build new question-answer pairs based on actual user queries

Common Problems and Solutions

Question: How to handle the situation where a problem has multiple sub-problems?

Solution : You can use a hierarchical structure to organize question-answer data, and establish a relationship between the main question and the sub-questions. When searching, first match the main question, and then introduce related sub-questions as needed.

{
    "Main question""How to use membership points?" ,
    "Main answer""Member points can be used for various purposes such as commodity discounts, gift redemption, etc..." ,
    "Sub-questions" : [
        {
            "Question""How to redeem points for goods?" ,
            "Answer""Select the 'Points Payment' option on the product page..."
        },
        {
            "Question""How long are the points valid for?" ,
            "Answer""Ordinary member points are valid for one year, while gold card member points are valid forever."
        }
    ]
}

Question: What should we do if the amount of Q&A data is large but the quality is uneven?

Solution : Implement data stratification strategy and establish a two-layer structure of core Q&A database and extended Q&A database. The core database contains high-quality, high-frequency Q&A; the extended database contains low-frequency or general-quality Q&A. When searching, the core database is prioritized for results. If there is no satisfying result in the core database, the extended database is searched.

Technology selection recommendations

To build a RAG system based on question-answering data, you can consider the following technology combinations:

  1. Vector databases : Milvus, Marqo, Weaviate, etc.
  2. Embedding model : You can choose an embedding model optimized for question answering, such as BGE Chinese embedding or BERT-QA series models
  3. Large language model : Choose the appropriate LLM according to specific needs. Domestic models such as Wenxin Yiyan and Zhipu AI perform well in Chinese question-answering scenarios.
  4. Search framework : LangChain, LlamaIndex, etc. provide a wealth of search tools

Conclusion

Question and answer data is a high-quality material for building RAG systems. Its inherent question-answer structure is naturally suitable for the application scenario of retrieval enhancement generation. Through reasonable data processing, indexing strategy and retrieval generation method, the value of question and answer data can be fully utilized to build an intelligent question and answer system with fast response and accurate answers.

Remember, there is no one-size-fits-all solution for the RAG system. It needs to be adjusted and optimized according to specific business scenarios. Only by continuously collecting user feedback and iteratively improving indexing and retrieval strategies can a truly practical intelligent question-answering system be created.