The same knowledge base has obvious differences in retrieval accuracy between Ds and qwq

Written by
Audrey Miles
Updated on:July-10th-2025
Recommendation

In-depth analysis of the performance differences of different retrieval models in the same knowledge base.

Core content:
1. The impact of model architecture differences on knowledge representation and reasoning
2. The impact of pre-training data distribution on knowledge transfer capabilities
3. The impact of fine-tuning strategies and hyperparameters on model performance

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Model architecture differences

The structural design of different models directly affects their ability to represent knowledge and their reasoning methods: 

•  Autoencoding vs Autoregression :

     •  BERT (Bidirectional Transformer): good at understanding contextual semantics (such as classification, entity recognition). 

     •  GPT (Unidirectional Transformer): Good at generating coherent text, but weak in global understanding of context.

 •  Attention mechanism : Sparse attention (such as  Longformer ) is better suited for long texts, while standard attention (such as  RoBERTa ) performs better on short texts. 

•  Model depth and width : Larger models with more parameters (such as  GPT-4 ) can capture more complex knowledge associations, but require more training resources.


2. Pre-training data distribution

Even if the knowledge base content is the same, differences in the model's pre-training data will lead to different knowledge transfer capabilities:

 •  Domain Bias :

• BioBERT ,       pre-trained on medical literature  , understands medical terminology better than general models (such as  BERT-base ).

• CodeBERT ,       which is pre-trained on code data  , performs better on programming knowledge base.

 •  Language and multimodal coverage :

       • Multilingual models (such as  XLM-R ) perform stably in multilingual knowledge bases, while single-language models (such as  BERT-zh ) are more accurate in Chinese scenarios. 

       • Multimodal models (such as  CLIP ) can associate text and image knowledge, but pure text models (such as  T5 ) cannot handle non-text content.


3. Fine-tuning strategies and hyperparameters

The same knowledge base has significant differences in performance under different fine-tuning methods: 

•  Learning rate and optimizer :

    • Too high a learning rate may cause the model to forget pre-training knowledge (catastrophic forgetting).

    • Models using  the AdamW  optimizer generally converge faster than  SGD  , but may generalize slightly worse.

 Task adaptation design : Adding a domain adaptation layer (such as  Adapter ) can retain pre-trained knowledge, but direct full parameter fine-tuning may be more suitable for small-scale knowledge bases.

 Data augmentation and regularization : Using  Dropout  or  Mixout  can prevent overfitting, but excessive regularization will weaken the model's capture of knowledge details.


4. Knowledge representation and retrieval methods

The models have different mechanisms for encoding and retrieving knowledge:

 •  Dense search vs sparse search : • Dense search (such as  DPR ) relies on vector similarity and is suitable for semantic matching. • Sparse search (such as  BM25 ) relies on keyword frequency and is suitable for exact term matching. 

•  Hierarchical knowledge processing : • Some models (such as  RAG ) explicitly separate knowledge storage and reasoning modules, while end-to-end models (such as  T5 ) implicitly encode knowledge in parameters.


5. Evaluation indicators and task objectives

The objective functions and evaluation indicators optimized by different models lead to different results: 

•  Generation tasks : • Models that optimize  BLEU  scores tend to generate fluent but conservative text. • Models that optimize  ROUGE  scores focus more on keyword coverage, possibly sacrificing fluency. 

•  Retrieval task : • Models that emphasize  Recall@K  will improve retrieval breadth, while models that optimize  MRR  focus more on ranking quality.


6. Hardware and inference efficiency limitations

Resource constraints indirectly affect knowledge utilization capabilities:

 •  Memory limitations : • Large models (such as  GPT-3 ) need to reduce the batch size or context length under limited memory, resulting in incomplete knowledge processing. 

•  Quantization and compression : • 8-bit quantized models (such as  GPTQ ) will lose some knowledge details and affect complex reasoning effects.


Typical scene comparison

Model Type
Knowledge Base Type
Advantageous scenarios
limitation
BERT
Short Text Encyclopedia
Entity linking and relation extraction
Weak ability to handle long texts
GPT-3
Open Domain Generation
Creative knowledge expansion
Low factual accuracy
T5
Structured Knowledge
Multi-task conversion (such as knowledge to text)
Need to explicitly design the task format
DPR
Large-scale search
Precise semantic matching
Relying on high-quality vector indexes
F
Multi-Document Q&A
Cross-document reasoning
High consumption of computing resources

Optimization suggestions

  1. Domain adaptation : • Select a pre-trained model that matches the domain of the knowledge base (such as  Legal-BERT for legal text ).
  2. Hybrid search : • Combining dense search (semantics) with sparse search (keywords), e.g.
    hybrid_score =  0.7  * dense_similarity +  0.3  * bm25_score
  3. Knowledge injection : • Inject domain knowledge into general models:
    python train.py --model bert-base --knowledge_augment_method entity_retrieval
  4. Evaluation consistency : • Use multiple indicators (such as  Accuracy + F1 + ROUGE-L ) to avoid single indicator bias.

Summarize

The essence of the performance difference of knowledge bases is the matching degree between model priors, training objectives and task requirements . The best practices are:

  1. Analyze knowledge base characteristics (structured/unstructured, long text/short text).
  2. Select the matching model architecture (generative/discriminative, dense/sparse).
  3. Targeted optimization and fine-tuning strategies (domain adaptation, hybrid retrieval).
  4. Compare results under the same evaluation framework.