Embedding model fine-tuning: quickly build training and evaluation datasets based on existing data

Written by
Caleb Hayes
Updated on:June-20th-2025
Recommendation

Master the fine-tuning of Embedding models to improve the performance of NLP tasks.

Core content:
1. The core concepts and data dependencies of Embedding model fine-tuning
2. Practical steps to build high-quality fine-tuning training sets and evaluation sets
3. Application of key technologies such as positive and negative sample construction and data enhancement

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Embedding model fine-tuning: quickly build training and evaluation datasets based on existing data

 Objectives of this article

This article is mainly for beginners who want to improve the performance of Embedding models in specific fields or tasks. I hope that after reading this article, it will help you:

  • Accurately understand the core concepts of Embedding model fine-tuning and its dependence on datasets.
  • Initially master the practical steps and strategies for building high-quality fine-tuning training sets and evaluation sets based on existing data.
  • Gain a preliminary understanding of the application and value of key technologies such as positive and negative sample construction and data enhancement in dataset construction.

? Table of contents

  • ? Objective of this article
  • ? Preface
  • ? Core concept analysis: Embedding fine-tuning and data elements
    • Embedding model fine-tuning explanation
    • The necessity of building a dedicated dataset
    • Key Components of Fine-tuning a Dataset
  • ? Practical Guide: Building a Fine-tuning Training Dataset
    • Data source assessment
    • Standardized training sample structure definition
    • Construction and optimization of positive samples (`pos`)
    • Negative sample (`neg`) construction strategy: improve model recognition
    • Application of Contrastive Learning Method in Embedding Fine-tuning
    • Principles and Practice of Contrastive Learning
    • Unsupervised SimCSE implementation example
    • Data Augmentation – Enriching the diversity of training samples
    • Adding auxiliary information and dividing the data set
  • ? Summary and Outlook
    • Core content review
  • ? Previous selections


? Preface

Embedding models play a fundamental role in NLP tasks related to semantic understanding by mapping text into vectors. However, general pre-trained models often perform poorly in specific fields due to lack of expertise.

To solve this problem, model fine-tuning is performed, which adapts the model to specific semantic features by further training on domain data. High-quality datasets are the key to successful fine-tuning. This article aims to explore how to efficiently build training and evaluation datasets for Embedding model fine-tuning based on existing data to improve the performance of the model in specific scenarios.



Core Concept Analysis: Embedding Fine-tuning and Data Elements

Before we talk about how to get the data set, there are a few core concepts that need to be understood first.

Embedding model fine-tuning explanation

Principle explanation: Embedding model fine-tuning refers to further training an Embedding model that has been pre-trained with large-scale general corpus using task-specific or field-specific datasets, thereby adjusting the model parameters so that it can better capture and express the semantic information in the target data.

Core objectives:

The core goal of fine-tuning can be understood from the perspective of metric learning  . It aims to optimize the representation of text in the vector space so that semantically similar or related text pairs (e.g., a question and its high-quality answer) are closer in the vector space, while semantically dissimilar or unrelated text pairs are farther away in the vector space. This can be formalized by a loss function, such as Triplet Loss:

in  is the vector of anchor points (such as queries), is the vector of positive examples, is the vector of negative examples, is a vector  and  A distance measure between (such as the Euclidean distance or the opposite of the cosine distance), is a hyperparameter that ensures the margin between positive and negative pairs. The goal is to minimize this loss.

This optimization makes the model more accurate when performing tasks such as similarity calculation and information retrieval.

The necessity of building a dedicated dataset

Constructing a dataset specifically for fine-tuning is mainly based on the following two considerations:

First, bridge the domain gap . General pre-training models learn a wide range of language knowledge, while specific application scenarios often have their own unique language patterns, terminology systems, and knowledge structures. For example, the language style and core vocabulary of texts in the financial field are very different from those of daily conversations. Fine-tuning datasets carry this domain-specific information, helping the model bridge the gap between general knowledge and specific needs.

Second, the data-driven learning paradigm . The model’s learning process is data-driven. By showing the model carefully constructed samples that reflect the requirements of the target task (for example, which texts should be considered relevant and which are not), the model can gradually learn effective patterns for distinguishing text semantics in this specific scenario.

Key Components of Fine-tuning a Dataset

A typical Embedding model fine-tuning dataset for contrastive learning or metric learning usually contains the following elements:

Query  :

This is a representation of an information need, which can be a question posed by a user, a search keyword, or any text for which the model needs to find relevant information.

Positive Sample, POS) :

This is text that is highly relevant or semantically consistent with a given query. During training, the model learns to narrow the distance between the query vector and the positive example vector.

Negative Sample, neg) :

This is text that is irrelevant, has low relevance, or is semantically inconsistent with a given query. During training, the model learns to push the distance between the query vector and the negative example vector. The quality and selection strategy of negative samples are critical for the model to learn to distinguish subtle semantic differences. Understanding these basic components and their role in model training is the basis for efficient dataset construction in the future.

? Practical Guide: Building a Fine-tuning Training Dataset

Next, we will combine a specific case scenario (financial-qa-10KDataset Processing Example) explains step by step how to build a high-quality fine-tuning training dataset.

Data source assessment

The first prerequisite for fine-tuning is to have data. You need to evaluate and select existing data resources that can reflect the target application scenario. This may include user behavior logs, existing question-answer pairs, document libraries, knowledge base content, etc. Taking financial question-answering as an example,financial-qa-10KSuch a dataset contains questions in the financial field, the corresponding answers, and the context in which the answers are located, and is a very suitable data source for fine-tuning.

After selecting the data source, you need to have a preliminary understanding of the data.

This includes analyzing its raw data structure, such as what fields it contains and what each field means.

At the same time, necessary preliminary data cleaning is carried out, such as removing invalid characters, processing missing values, unifying text encoding, etc., to ensure data quality.

existfinancial-qa-10KIn the example above, the original data contains columns like 'question', 'answer', 'context', 'ticker', 'filing', etc. We need to understand how these columns serve our fine-tuning goals.

The data format description:

The meanings of the fields in the original data format are as follows:

  • Question: Specific inquiries about the company, such as "Which field did NVIDIA initially focus on, and which fields did it expand into later?"
  • answer: A direct answer to the question, usually extracted from the company's financial statements, such as "NVIDIA initially focused on PC graphics"
  • context: The original text with full context, e.g. "From our initial focus on PC graphics, we have expanded into a number of other important compute-intensive areas"
  • ticker: company stock code, such as "NVDA" for NVIDIA
  • Filing: The financial filing information of the data source, such as "2023_10K" means the annual report from 2023

Standardized training sample structure definition

Determining the target format

Different fine-tuning frameworks or models may have specific requirements for the input data format. A common and effective structure is the JSON Lines format, where each line is a JSON object representing a training sample. The object usually contains fields such as query, positive examples, negative examples, etc. For example:

{
    "query" : str,            // query text
    "pos" : List[str],       // positive sample list
    "neg" : List[str],       // Negative sample list
    "pos_scores" : List[int],  // Positive sample score list 
    "neg_scores" : List[int],  // Negative sample score list
    "prompt" : str,          // prompt information
    "type" : str            // data type
}

Here,query is the query sentence,POS is one or more positive text lists,neg Well, it is also a list of one or more negative texts.

If we use knowledge distillation,pos_scores and neg_scores If you don't use knowledge distillation, you don't need to use these two parameters.

prompt It is a hint to the model, telling it how to handle this query. type field.

The following is a specific example of a single training data sample:

{
  "query""What is the price-to-earnings ratio and how does it help investors assess the value of a stock?" ,
  "pos" : [
    "The Price-to-Earnings Ratio (P/E Ratio) is an indicator that measures the stock price relative to earnings per share. The calculation formula is: P/E Ratio = Current Stock Price / Earnings Per Share (EPS). It reflects how much investors are willing to pay for each dollar of earnings." ,
    "Investors often use the P/E ratio to determine whether a stock's valuation is reasonable. A low P/E ratio may mean that the stock is undervalued, while a high P/E ratio may indicate that the stock is overvalued or that the market expects its future earnings to grow rapidly. However, industry characteristics and company growth stage should be considered when comparing P/E ratios."
  ],
  "neg" : [
    "The Price-to-Book Ratio (P/B Ratio) is the ratio of the stock price to the net asset value per share, and is often used to evaluate the value of asset-intensive companies such as banks and insurance companies." ,
    "The Dividend Yield Ratio refers to the ratio of a company's total annual dividend to its current market price, and is one of the indicators for measuring stock investment returns." ,
    "Technical analysis focuses on historical data on stock prices and trading volumes, using chart patterns to predict future price movements, unlike fundamental analysis methods."
  ],
  "prompt""Represent this sentence for searching relevant documents: " ,
  "type""normal"
}

In this example:

  • "query" This is a question raised by a user regarding the price-to-earnings ratio.
  • "pos" The list contains two positive answers/explanations that are highly relevant to the query.
  • "neg" The list contains some text related to the financial field but not directly relevant or wrong to the query "P/E ratio", such as definitions of price-to-book ratio, dividend yield, or completely unrelated technical analysis concepts.
  • "prompt" is an optional directive that instructs the model on how to handle queries.
  • "type" It is an optional field used for bge-en-icl, including normal, symmetric_class, symmetric_clustering and other types.

Field mapping and type conversion

According to the defined target format, it is necessary to select the core information from the original data and convert it into the corresponding fields in the target structure.

existfinancial-qa-10KIn the example, we select the original 'question' column asquery, select 'context' (or 'answer', depending on the specific task goal) asPOSThis step is usually accompanied by renaming of fields and conversion of data types.

Positive samples (POS) Construction and optimization

The so-called positive sample (Positive Sample, POS), refers to text that is highly semantically relevant or matches the query we care about. These samples are designed to teach the model to understand "what is similar" or "which content is the correct answer/relevant document for the query". The quality of positive samples directly affects the model's understanding of "relevance".

The core principle of constructing positive samples is to ensure thatqueryThere is a strong correlation or semantic consistency between them. This means that it is necessary to accurately extract or match the content that can truly answer the query or is highly consistent with the query semantics from the original data as positive samples.

The text granularity of the positive sample (whether to choose a complete sentence, a paragraph, or an entire document) needs to be determined according to the specific application scenario and model capabilities. If context is critical to understanding, then choosing a paragraph with a more complete context may be better than a single sentence.

When extracting positive samples, the irrelevant information or noise contained in them should be minimized. A "clean" positive sample allows the model to learn the core semantic associations more efficiently and avoid being disturbed by irrelevant text fragments.

Usually the positive samplePOSProcessed into a list format (List[str]). This can support the scenario where one query corresponds to multiple high-quality positive examples. It also makes the data processing flow more unified. Even if there is only one positive example, it is packaged in a list.

Negative samples (neg) Construction strategy: Improve model recognition

Correspondingly, negative samples neg) refers to text that is irrelevant, has low relevance, or is semantically inconsistent with the query. Their role is to help the model learn to distinguish "what is dissimilar" or "what content is not what the query wants". Negative samples play a vital role in fine-tuning the Embedding model. They help the model learn to distinguish texts that look similar but are actually irrelevant, thereby shaping an effective decision boundary and preventing the model from treating all content as similar (i.e., model collapse). The so-called model collapse usually refers to the fact that during the training process, the vector representations of all or most of the texts learned by the model become very similar and lose their distinguishability, resulting in the model being unable to effectively identify the semantic differences between different texts.

High-quality negative samples can significantly improve the fine-grained semantic recognition ability of the model. They force the model to learn not only "what is relevant", but also "what is irrelevant and why it is irrelevant".

There are many technical paths to construct negative samples, each with its own characteristics and applicable scenarios. Below we will introduce several common and effective negative sample construction strategies in detail.

In-batch Negatives

Principle : During the training process, for each query in the current processing batch:

  • If it is an asymmetric task (such as question answering): treat the positive examples of other samples in the batch as negative examples of the current query
  • If it is a symmetric task (such as sentence similarity matching): the queries of other samples in the batch and the positive examples of other samples can be regarded as negative examples of the current query
Example of constructing negative samples within a batch

Let's use a specific example to understand the process of constructing negative samples in a batch. Suppose we have a batch containing 4 training samples:

batch = [
    {
        "query""What is the price-to-earnings ratio" ,
        "pos" : [ "The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share. It is calculated by dividing the stock's price by its earnings per share." ]
    },
    {
        "query""How does Bitcoin work" ,
        "pos" : [ "Bitcoin is a cryptocurrency based on blockchain technology. It is generated through mining and transactions are recorded on a public distributed ledger." ]
    },
    {
        "query""How to calculate compound interest" ,
        "pos" : [ "The compound interest calculation formula is A=P(1+r)^n, where A is the final amount, P is the principal, r is the interest rate, and n is the time period." ]
    },
    {
        "query""What is inflation" ,
        "pos" : [ "Inflation is an economic phenomenon in which the general price level continues to rise, resulting in a decrease in the purchasing power of money." ]
    }
]

Construction of negative samples in batches for asymmetric task scenarios (such as question answering):

For each query, we use the positive examples of other samples in the batch as negative examples for that query:

# Build in-batch negative samples for the first query "What is the P/E ratio"
query_1 = batch[ 0 ][ "query" ]
pos_1 = batch[ 0 ][ "pos" ][ 0 ]
neg_1 = [batch[ 1 ][ "pos" ][ 0 ], batch[ 2 ][ "pos" ][ 0 ], batch[ 3 ][ "pos" ][ 0 ]]

# The final training data structure for the first query
sample_1 = {
    "query""What is the price-to-earnings ratio" ,
    "pos" : [ "The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share. It is calculated by dividing the stock's price by its earnings per share." ],
    "neg" : [
        "Bitcoin is a cryptocurrency based on blockchain technology, which is generated through mining and transactions are recorded on a public distributed ledger." ,
        "The compound interest calculation formula is A=P(1+r)^n, where A is the final amount, P is the principal, r is the interest rate, and n is the time period." ,
        "Inflation is an economic phenomenon in which the general price level continues to rise, resulting in a decrease in the purchasing power of money."
    ]
}

In-batch negative sample construction in symmetric task scenarios (such as semantic similarity matching):

In symmetric tasks, not only positive examples of other samples can be used as negative examples, but also queries of other samples themselves can be used as negative examples:

# Build in-batch negative samples for the first query "What is the P/E ratio"
query_1 = batch[ 0 ][ "query" ]
pos_1 = batch[ 0 ][ "pos" ][ 0 ]
neg_1 = [
    # Other sample queries
    batch[ 1 ][ "query" ], 
    batch[ 2 ][ "query" ], 
    batch[ 3 ][ "query" ],
    # Positive examples of other samples
    batch[ 1 ][ "pos" ][ 0 ], 
    batch[ 2 ][ "pos" ][ 0 ], 
    batch[ 3 ][ "pos" ][ 0 ]
]

# The final training data structure for the first query
sample_1 = {
    "query""What is the price-to-earnings ratio" ,
    "pos" : [ "The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share. It is calculated by dividing the stock's price by its earnings per share." ],
    "neg" : [
        # Other queries as negative examples
        "How Bitcoin Works" ,
        "How to calculate compound interest" ,
        "What is inflation? "
        # Other positive examples are used as negative examples
        "Bitcoin is a cryptocurrency based on blockchain technology, which is generated through mining and transactions are recorded on a public distributed ledger." ,
        "The compound interest calculation formula is A=P(1+r)^n, where A is the final amount, P is the principal, r is the interest rate, and n is the time period." ,
        "Inflation is an economic phenomenon in which the general price level continues to rise, resulting in a decrease in the purchasing power of money."
    ]
}

The implementation process of negative samples within the batch:

In actual training, negative samples in a batch are usually constructed dynamically in the data loader or training loop rather than prepared in advance. Here is a simplified implementation flow:

def construct_in_batch_negatives (batch, symmetric=False) : 
    """
    Construct in-batch negative samples for each sample in the batch
    
    Args:
        Batch: A batch containing multiple training samples
        symmetric: whether it is a symmetric task
        
    Returns:
        Augmented batches containing negative samples within the batch
    """

    enhanced_batch = []
    
    for  i, sample  in  enumerate(batch):
        query = sample[ "query" ]
        pos = sample[ "pos" ]
        neg = []
        
        # Collect negative examples from other samples in the batch
        for  j, other_sample  in  enumerate(batch):
            if  i != j:   # exclude itself
                # For symmetric tasks, queries of other samples can also be used as negative examples
                if  symmetric:
                    neg.append(other_sample[ "query" ])
                
                # Other samples' positive examples are used as negative examples
                neg.extend(other_sample[ "pos" ])
        
        # Build enhanced samples
        enhanced_sample = {
            "query" : query,
            "pos" : pos,
            "neg" : neg
        }
        
        enhanced_batch.append(enhanced_sample)
    
    return  enhanced_batch

The advantage of in-batch negative samples is that they are added conveniently and do not require additional data preparation. At the same time, they can provide more challenging counterexamples than random negative samples because they come from the same batch and have certain correlation. However, it is obvious that their quality and difficulty may not be as good as specially designed difficult negative samples.

Hard Negatives

Hard negative samples pose more sophisticated recognition challenges to the model, forcing it to learn more subtle semantic differences, thereby significantly improving the accuracy and robustness of the model in real application scenarios.

Definition : Refers to samples that have a high similarity with the query in semantics or text representation and are easily misclassified as positive examples by the model, but are actually irrelevant or have a low relevance to the query.

Mining Technology :

  • Sparse retrieval recall : Using traditional frequency-weighted retrieval algorithms such as BM25, a batch of highly similar documents are recalled for each query. After removing the true positive examples, the remaining documents can be used as candidate sets for hard negative samples.
  • Using other models for screening : You can use a pre-trained or fine-tuned Embedding model (or even a more sophisticated ranking model, such as Cross-Encoder) to score candidate negative samples and select those samples that the model predicts a high relevance score but are actually negative. The relevance score is usually calculated by calculating the query vector  and document vector  The cosine similarity between them is:
    The closer the score is to 1, the more similar the semantics are.
  • Combined with domain knowledge or rules : In a specific domain, hard negative samples can be constructed based on existing domain knowledge or business rules. For example, in e-commerce product recommendations, products of the same category but different brands, or products with similar functions but significantly different attributes such as price range and target users, may all constitute effective hard negative samples.
Hard negative sample construction example

Let's use a specific example to illustrate how to use BM25 and the pre-trained Embedding model to mine hard negative samples:

from  rank_bm25  import  BM25Okapi
import  numpy  as  np
from  sentence_transformers  import  SentenceTransformer
import  jieba   # Add jieba import for Chinese word segmentation

# 1. Prepare the corpus
corpus = [
    "The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share. It is calculated by dividing the stock's price by its earnings per share." ,
    "The price-to-book ratio is the ratio of the share price to the net asset value per share and is often used to assess the value of asset-intensive companies such as banks." ,
    "The dividend yield is the ratio of a company's total annual dividends to its current stock price, a measure of investment return." ,
    "The price-to-sales ratio is the ratio of a stock's price to its sales per share and is used to evaluate growth companies that are not yet profitable." ,
    "The enterprise value multiple is the ratio of enterprise value to EBITDA, a valuation metric that takes into account a company's debt level." ,
    "The discounted cash flow model estimates the intrinsic value of a company by predicting future cash flows and discounting them to the present." ,
    "Technical analysis focuses on historical data of stock prices and trading volumes to predict future trends." ,
    "Fundamental analysis focuses on factors such as a company's financial condition, the quality of its management, and its market position. "
    "Portfolio theory advocates diversifying risk and optimizing risk-return ratio through asset diversification." ,
    "Passive investing strategies track the performance of a specific market index by purchasing index funds or ETFs."
]

# Queries and known positive examples
query =  "What is the P/E ratio and how to use it to evaluate stock value"
true_positive = corpus[ 0 ]   # The first text about P/E ratio is a true positive example

# 2. Use Jieba word segmentation to perform BM25 search (sparse search stage)
tokenized_corpus = [list(jieba.cut(doc))  for  doc  in  corpus]   # tokenize corpus
tokenized_query = list(jieba.cut(query))   # Segment the query

bm25 = BM25Okapi(tokenized_corpus)
bm25_scores = bm25.get_scores(tokenized_query)

# Get the document index after BM25 sorting (sorted from high to low by relevance)
sorted_indices = np.argsort(bm25_scores)[:: -1 ]   # sort in descending order
print( "BM25 search result sorting: " )
for  idx  in  sorted_indices[: 5 ]:   # get the top 5
    print( f"Document {idx}  (score:  {bm25_scores[idx]: .4 f} ):  {corpus[idx][: 50 ]} ..." )

# 3. Rerank using the Embedding model (dense retrieval phase)
# Load the pre-trained Embedding model
model = SentenceTransformer( r'C:\Users\k\Desktop\BaiduSyncdisk\baidu_sync_documents\hf_models\bge-m3' , trust_remote_code= True )   # Example model

# Calculate embedding vectors for query and all documents
query_embedding = model.encode([query])[ 0 ]
corpus_embeddings = model.encode(corpus)

# Calculate cosine similarity
from  sklearn.metrics.pairwise  import  cosine_similarity
similarities = cosine_similarity([query_embedding], corpus_embeddings)[ 0 ]

# Get the document index after the embedding model is sorted
sorted_indices_emb = np.argsort(similarities)[:: -1 ]   # sort in descending order
print( "\nEmbedding model re-ranking results: " )
for  idx  in  sorted_indices_emb[: 5 ]:   # Take the top 5
    print( f"Document {idx}  (Similarities:  {similarities[idx]: .4 f} ):  {corpus[idx][: 50 ]} ..." )

# 4. Identify hard negative samples (documents with high similarity but not actually relevant)
# Remove true positive examples
hard_negatives_candidates = [idx  for  idx  in  sorted_indices_emb  if  corpus[idx] != true_positive]

# Select the first N candidates as hard negative samples
hard_negatives = [corpus[idx]  for  idx  in  hard_negatives_candidates[: 2 ]]   # Take the first 2 as hard negative samples

# 5. Final training sample structure
training_sample = {
    "query" : query,
    "pos" : [true_positive],
    "neg" : hard_negatives
}

print( "\nThe final constructed training data containing hard negative samples: " )
print( f"Query:  {training_sample[ 'query' ]} " )
print( f"Positive example:  {training_sample[ 'pos' ][ 0 ]} " )
print( "Difficult negative samples:" )
for  i, neg  in  enumerate(training_sample[ 'neg' ]):
    print( f"   {i+ 1 }{neg} " )

Actual execution results:

BM25 search results sorting:
Document 0 (Score: 1.7982): The price-to-earnings ratio is an indicator that measures the stock price relative to earnings per share. The calculation formula is the stock price divided by...
Document 5 (Score: 0.7829): The discounted cash flow model estimates the intrinsic value of a company by predicting future cash flows and discounting them to the present...
Document 3 (Score: 0.7425): The price-to-sales ratio is the ratio of stock price to sales per share, and is used to evaluate companies that have not yet made a profit.
Document 1 (score: 0.7238): The price-to-book ratio is the ratio of the stock price to the net asset value per share, and is often used to evaluate assets such as banks...
Document 9 (Score: 0.0000): Passive investment strategies track the performance of a specific market index by purchasing index funds or ETFs...

Embedding model reranking results:
Document 0 (Similarity: 0.8059): The price-to-earnings ratio is an indicator that measures the stock price relative to earnings per share. The calculation formula is the stock price divided by...
Document 1 (Similarity: 0.7493): The price-to-book ratio is the ratio of the stock price to the net asset value per share, and is often used to evaluate assets such as banks...
Document 3 (Similarity: 0.7044): The price-to-sales ratio is the ratio of stock price to sales per share, and is used to evaluate unprofitable companies.
Document 2 (Similarity: 0.5833): The dividend yield is the ratio of a company's total annual dividends to its current stock price, measuring the return on investment...
Document 4 (Similarity: 0.5568): The enterprise value multiple is the ratio of enterprise value to EBITDA, taking into account the company's...

The final constructed training data contains hard negative samples:
Query: What is the P/E ratio and how to use it to evaluate stock value
Positive example: The P/E ratio is a measure of a stock's price relative to its earnings per share and is calculated by dividing the stock price by its earnings per share.
Hard negative examples:
  1. The price-to-book ratio is the ratio of the share price to the net asset value per share, and is often used to evaluate the value of asset-intensive companies such as banks.
  2. The price-to-sales ratio is the ratio of stock price to sales per share and is suitable for evaluating growth companies that are not yet profitable.

From this actual running result, we can see that hard negative samples are often those that have some similarity with the query in terms of words (such as discussing financial valuation indicators), and may even share some keywords ("market X ratio", "stock price"), but are not actually the information the query really wants. Typical examples include indicators such as "price-to-book ratio" and "price-to-sales ratio", which are similar to "price-to-earnings ratio" in form and are both stock valuation indicators, but with different concepts and usage scenarios. This kind of highly similar but actually irrelevant text is exactly what is most likely to confuse the model, so using this kind of hard negative samples in training can effectively improve the model's fine-grained semantic distinction ability.

The number of negative samples is usually greater than that of positive samples (for example, each positive sample is equipped with multiple negative samples). At the same time, it is also very important to ensure the diversity of negative samples, that is, negative samples should cover different types of irrelevant situations, not just a single type of easily distinguishable samples. A trade-off needs to be made between quantity and diversity to achieve the best training effect and efficiency.

Application of Contrastive Learning Method in Embedding Fine-tuning

Contrastive learning is a representation learning method that constructs positive and negative sample pairs to allow the model to learn to bring semantically similar samples closer in the representation space and push dissimilar samples farther apart. In embedding fine-tuning, such methods are particularly effective because they directly optimize the distribution of text representations in the vector space and improve the accuracy of semantic similarity.

Principles and Practice of Contrastive Learning

Mathematical principles

The core of contrastive learning is the InfoNCE loss function, which optimizes the model by maximizing the similarity of positive sample pairs while minimizing the similarity of negative sample pairs:

in  is the query vector, is a positive sample vector, is a negative sample vector, is the similarity function, is the temperature parameter.

Key parameter description:

  • Similarity function: Cosine similarity is usually used 
  • Temperature parameters : Controls the smoothness of the probability distribution. A smaller value will make the model distinguish sample pairs more clearly.
  • Negative sample set : Contains samples that are not relevant to the query

Optimization goals

By minimizing the InfoNCE loss, the model can:

  1. Positive sample optimization : increase the similarity of related sample pairs (numerator)
  2. Negative sample optimization : reduce the similarity of unrelated sample pairs (denominator)
  3. Distribution Adjustment : Adjusting Learning Difficulty via Temperature Parameters

Mainstream technology implementation

  1. SimCSE Technology

  • Unsupervised version: Use dropout to construct positive sample pairs
  • Supervised version: Use NLI data to construct positive and negative samples
  • DiffCSE Technology

    • Based on SimCSE, joint training of contrastive learning and difference prediction is added
    • Enhance the model's sensitivity to subtle differences through the generator and discriminator structures

    Unsupervised SimCSE implementation example

import  torch
from  transformers  import  AutoModel, AutoTokenizer
import  numpy  as  np
import  matplotlib.pyplot  as  plt
from  sklearn.metrics.pairwise  import  cosine_similarity

"""
SimCSE (Simple Contrastive Sentence Embedding) is a method for improving sentence embeddings through contrastive learning.
Core idea:

1. Unsupervised learning: Use two representations of the same sentence generated by different dropout masks as positive sample pairs
2. Other sentences in the same batch are used as negative samples
3. The training goal is to make the representation of positive sample pairs similar, but dissimilar to the representation of negative samples
"""


# Load the model and tokenizer

tokenizer = AutoTokenizer.from_pretrained( "bert-base-chinese" )
model = AutoModel.from_pretrained( "bert-base-chinese" )

# Prepare input sentences - intentionally design semantically similar and dissimilar sentence pairs

sentences = [
    "The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share." ,   # Financial indicators related
    "The P/E ratio is used to assess the rationality of stock valuations." ,          # Similar to the first sentence in semantics
    "Inflation is an economic phenomenon in which prices continue to rise." #         Economic phenomenon, related to the previous two sentences but different concepts
    "Earnings per share is a company's net income divided by the number of shares outstanding." # Related to the first sentence, both involve earnings per share         
]

def get_sentence_embeddings (model, tokenizer, sentences, use_simcse=False) : 
    """Get sentence embedding, optional use of SimCSE method"""
    inputs = tokenizer(sentences, padding= True , truncation= True , return_tensors= "pt" )

    if  use_simcse:
        # SimCSE method: Use different dropouts to generate two representations for each sentence, and then take the average
        model.train()   # Activate dropout
        # Run twice to get different representations
        outputs1 = model(**inputs, output_hidden_states= True )
        outputs2 = model(**inputs, output_hidden_states= True )
        # Get CLS token
        embeddings1 = outputs1.last_hidden_state[:,  0 ]
        embeddings2 = outputs2.last_hidden_state[:,  0 ]
        # Take the average as the final representation
        embeddings = (embeddings1 + embeddings2)   else :
        # Traditional method: directly obtain sentence representation
        model.eval()   # Turn off dropout
        with  torch.no_grad():
            outputs = model(**inputs, output_hidden_states= True )
            embeddings = outputs.last_hidden_state[:,  0 ]
    
    return  embeddings

# Demonstrate SimCSE training process

def demonstrate_simcse_training () : 
    print( "=== SimCSE training process demonstration===" )
    # Convert sentences to model input, input the same batch twice (to use different dropout masks)
    inputs = tokenizer(sentences, padding= True , truncation= True , return_tensors= "pt" )
    inputs_repeated = {k: torch.cat([v, v])  for  k, v  in  inputs.items()}

    # Forward propagation, get CLS representation
    model.train()   # Ensure dropout is activated
    outputs = model(**inputs_repeated, output_hidden_states= True )
    last_hidden = outputs.last_hidden_state
    cls_embeds = last_hidden[:,  0 ]   # Get CLS token representation
    
    # Separate the representation of original samples and repeated samples
    batch_size = len(sentences)
    z1, z2 = torch.split(cls_embeds, batch_size)
    
    # Calculate cosine similarity
    cosine_sim = torch.nn.functional.cosine_similarity(z1.unsqueeze( 1 ), z2.unsqueeze( 0 ), dim= 2 )
    
    # Calculate contrastive learning loss (InfoNCE/NT-Xent)
    # The elements on the diagonal represent the similarity between two different representations of the same sentence (positive samples)
    # Non-diagonal elements represent the similarity between different sentences (negative samples)
    # The training goal is to maximize the value of the diagonal elements
    labels = torch.arange(batch_size).to(cosine_sim.device)
    temperature =  0.05 # Temperature parameter, controls the smoothness of the distribution  
    loss = torch.nn.CrossEntropyLoss()(cosine_sim temperature, labels)
    
    print( f"SimCSE contrast loss: {loss.item(): .4 f} " )
    print( "Cosine similarity matrix (in training):" )
    print(cosine_sim.detach().numpy())
    print( "Average similarity of diagonal elements (positive sample pairs): " , torch.mean(torch.diag(cosine_sim)).item())
    print( "Average similarity of non-diagonal elements (negative sample pairs):"
          (torch.sum(cosine_sim) - torch.sum(torch.diag(cosine_sim))) (batch_size * batch_size - batch_size))

# Compare sentence embedding effects with and without SimCSE

def compare_embeddings () : 
    print( "\n=== Compare the effects of traditional embedding and SimCSE embedding===" )

    # Get traditional sentence embedding
    traditional_embeddings = get_sentence_embeddings(model, tokenizer, sentences, use_simcse= False )
    traditional_embeddings = traditional_embeddings.detach().numpy()
    
    # Get SimCSE enhanced sentence embedding
    simcse_embeddings = get_sentence_embeddings(model, tokenizer, sentences, use_simcse= True )
    simcse_embeddings = simcse_embeddings.detach().numpy()
    
    # Calculate the similarity matrix
    traditional_sim = cosine_similarity(traditional_embeddings)
    simcse_sim = cosine_similarity(simcse_embeddings)
    
    # Display results
    print( "Similarity matrix of traditional method:" )
    print(np.round(traditional_sim,  3 ))
    
    print( "\nSimilarity matrix of SimCSE method:" )
    print(np.round(simcse_sim,  3 ))
    
    print( "\nSemantic relationship of sentence pair:" )
    for  i  in  range(len(sentences)):
        for  j  in  range(i+ 1 , len(sentences)):
            print( f"Sentence {i+ 1 } and sentence {j+ 1 } :" )
            print( f" - traditional similarity:  {traditional_sim[i,j]: .3 f} " )
            print( f" - SimCSE similarity:  {simcse_sim[i,j]: .3 f} " )
            print( f" - sentence {i+ 1 }{sentences[i]} " )
            print( f" - sentence {j+ 1 }{sentences[j]} " )
            print()

# Run the demo
if  __name__ ==  "__main__" :
    demonstrate_simcse_training()
    compare_embeddings()

Output:

=== SimCSE training process demonstration ===
SimCSE contrast loss: 0.1850
Cosine similarity matrix (in training):
[[0.8296242 0.79532236 0.7019348 0.736534 ]
 [ 0.7617904 0.9060811 0.683814 0.7251259 ]
 [ 0.5913595 0.67814845 0.8314158 0.6420494 ]
 [0.6655865 0.657691 0.528461 0.886128 ]]
Average similarity of diagonal elements (positive sample pairs): 0.8633122444152832
Average similarity of off-diagonal elements (negative sample pairs): tensor(0.6807, grad_fn=<DivBackward0>)

=== Comparison of traditional embedding and SimCSE embedding effects ===
Similarity matrix of traditional method:
[[1. 0.882 0.774 0.826]
 [0.882 1. 0.736 0.778]
 [0.774 0.736 1. 0.661]
 [0.826 0.778 0.661 1. ]]

Similarity matrix of SimCSE method:
[[1. 0.835 0.747 0.82 ]
 [0.835 1. 0.737 0.804]
 [0.747 0.737 1. 0.676]
 [0.82 0.804 0.676 1. ]]

Semantic relationship of sentence pairs:
Sentence 1 and Sentence 2:
  - Traditional similarity: 0.882
  - SimCSE similarity: 0.835
  - Sentence 1: The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share.
  - Sentence 2: The P/E ratio is used to assess the rationality of a stock's valuation.

Sentence 1 and Sentence 3:
  - Traditional similarity: 0.774
  - SimCSE similarity: 0.747
  - Sentence 1: The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share.
  - Sentence 3: Inflation is an economic phenomenon in which prices continue to rise.

Sentence 1 and Sentence 4:
  - Traditional similarity: 0.826
  - SimCSE similarity: 0.820
  - Sentence 1: The price-to-earnings ratio is a measure of a stock's price relative to its earnings per share.
  - Sentence 4: Earnings per share is a company's net profit divided by the number of outstanding shares.

Sentence 2 and Sentence 3:
  - Traditional similarity: 0.736
  - SimCSE similarity: 0.737
  - Sentence 2: The P/E ratio is used to assess the rationality of a stock's valuation.
  - Sentence 3: Inflation is an economic phenomenon in which prices continue to rise.

Sentence 2 and Sentence 4:
  - Traditional similarity: 0.778
  - SimCSE similarity: 0.804
  - Sentence 2: The P/E ratio is used to assess the rationality of a stock's valuation.
  - Sentence 4: Earnings per share is a company's net profit divided by the number of outstanding shares.

Sentence 3 and Sentence 4:
  - Traditional similarity: 0.661
  - SimCSE similarity: 0.676
  - Sentence 3: Inflation is an economic phenomenon in which prices continue to rise.
  - Sentence 4: Earnings per share is a company's net profit divided by the number of outstanding shares.

Data Augmentation – Enriching the diversity of training samples

When the amount of existing labeled data is limited, data augmentation is an effective technical means to expand the size of the training set and improve the generalization ability of the model without significantly increasing the cost of manual labeling.

Data augmentation generates new and reasonable training samples by performing a series of transformations on existing training samples. This helps the model learn to be robust to more diverse changes in the input text and reduce the risk of overfitting, especially in small sample scenarios.

Simple enhancements based on vocabulary and grammar

Compared with large language model-generated data augmentation methods, simple augmentation techniques based on vocabulary and grammar are lighter and more efficient to implement, and are suitable for rapidly expanding training data. The following are several commonly used simple augmentation methods:

  1. Synonym Replacement
  • Use resources such as WordNet and thesaurus to replace some words in the text with their synonyms
  • Example:
    • original:"The price-to-earnings ratio is an important indicator for measuring stock prices"
    • Enhancements:"The price-to-earnings ratio is a key indicator for evaluating stock prices"
  1. Back Translation Enhancement
  • Translate text into other languages ​​and then translate it back to the original language to produce text with different expressions but similar semantics
  • Example:
    • original:"How to calculate the price-to-earnings ratio of a stock"
    • Intermediate translation:"How to calculate the P/E ratio of stocks?"
    • Back translation enhancements:"How to calculate the price-earnings ratio of stocks"
  1. Random Insertion
  • Insert synonyms or related descriptive words at random positions in the sentence
  • Example:
    • original:"The price-to-earnings ratio reflects the stock valuation level"
    • Enhancements:"The P/E ratio accurately reflects the current stock valuation level"
  1. Word Order Adjustment (Random Swap)
  • Adjust the order of words or phrases in a sentence while maintaining semantics
  • Example:
    • original:"Investors often use the price-to-earnings ratio to value stocks"
    • Enhancements:"When valuing stocks, investors often use the price-to-earnings ratio"
Data augmentation based on large language model

Sentence rewriting/paraphrasing : Use pre-trained language models such as the Qwen series and GPT series to generate new sentences with the same semantics as the original text but different expressions. The powerful language capabilities of the large language model make the generated text more natural and diverse.

Diversified query generation : for existing positive example documents (POS), can use the large language model to generate queries with different questioning. For example, for a text about "the definition and calculation of price-earnings ratio", LLM can generate multiple expressions such as "What is P/E Ratio?", "How to calculate the price-earnings ratio of a stock", and "What is the use of the price-earnings ratio indicator".

Diversified Positive  Sample Generation: For a given query, the large language model can generate multiple semantically related but differently expressed positive sample texts based on its understanding, helping the model learn different ways of expressing the same concept.

Hard negative sample candidate generation : Through carefully designed prompts (Prompt Engineering), the large language model is guided to generate text that is related to the query topic but has incorrect details, or belongs to the same field but discusses different subtopics, as candidates for high-quality hard negative samples.

Instruction-driven text rewriting : Large language models can rewrite text according to specific instructions (such as "simplify professional content into language that ordinary people can understand" and "retain core information but make the expression more concise") to create training samples of different styles and complexities.

Adding auxiliary information and dividing the data set

prompt Function of (prompt)

In some fine-tuning frameworks, you can add apromptOr a directive prefix.financial-qa-10KAdded in the exampleinstruction = "Represent this sentence for searching relevant passages: ".thispromptIt can guide the model on how to understand and process queries, such as indicating that this is a sentence for paragraph retrieval, or that this is a question that needs summarization. It helps the model produce more appropriate vector representations for different tasks or intentions.promptUsually used asquery_instruction_for_retrievaluse.

Furthermore, we can define different prompts for different task types and select corresponding prompts according to the task type during training and inference. For example, a configuration file may contain the following prompt definition. Note that not all files look like this example. For example, there is only one in jina-clip-v2 instead of three:

{
  "prompts" : {
    "retrieval.query""Represent the query for retrieving evidence documents: " ,
    "retrieval.document""Represent the document for retrieval: " ,
    "classification""Classify the text: "
  },
  "default_prompt_name""retrieval.document" // Example: the default prompt name 
}

In this example:

  • "retrieval.query": When processing a query for retrieving relevant documents, this hint can be added before the query text to guide the model to generate query vectors suitable for retrieval.
  • "retrieval.document": When processing documents for building a retrieval library, this hint can be added before the document text to guide the model to generate document vectors suitable for retrieval.
  • "classification": When the task is text classification, this hint can be used to guide the model to generate vector representations that help distinguish text categories.
  • "default_prompt_name": You can specify a default prompt to use when no task type is explicitly specified or no corresponding prompt is found.

In this way, the same basic Embedding model can be adapted to multiple downstream tasks through different prompts, which enhances the versatility and flexibility of the model.typefield (if it exists) or the nature of the task itself to select the appropriatepromptIn the last part, I would like to remind you that not every model can support these prompt words. Some models do not set prompt words by default, and some models set the default prompt word to retrieval.document, etc. You can refer to the model configuration file for details.

Reasonable division of data sets

Finally, the constructed complete dataset is divided into a training set, a validation set (optional), and a test set according to a certain ratio (e.g. 8:1:1 or 9:1). The training set is used to update and learn model parameters. The validation set is used to monitor model performance during training, perform hyperparameter tuning, and prevent overfitting. After the model training is completed, the test set is used to finally evaluate the generalization ability of the model on unseen data. When dividing, attention should be paid to the randomness of the data and stratified sampling (if the categories are unbalanced) to ensure that the data distribution of each set is as consistent as possible.



? Summary and Outlook

Core content review

This article briefly discusses how to build training and evaluation datasets for fine-tuning the Embedding model based on existing data. The core steps include:

  • Clarify the fine-tuning objectives and dataset composition.
  • Carefully construct training data: covering data source selection, structure definition, positive and negative sample construction, data enhancement, and auxiliary information addition.
  • Build a standardized evaluation system: Standard data sets should include training sets, test sets, and validation sets.


? Previous selections

  • My RAG Pitfalls and Advancement Path: Sharing of a Meta-Cognition-Driven Exploration Experience
  • Getting rid of Faiss: How sqlite-vec implements lightweight native vector retrieval of SQLite in RAG
  • Lightweight database vector search DuckDB VSS Getting Started Guide: From installation to HNSW index optimization, just read this article
  • Train a tokenizer to understand GPT/BERT text processing and the working mechanism of BPE tokenizer (2)
  • Train a tokenizer to understand GPT/BERT text processing and the working mechanism of BPE tokenizer

? Supplementary information and references

The content of this article is mainly based on (financial-qa-10KDataset processing examples, because it is the dataset used in the bge tutorial), and combines general academic knowledge and public technical literature in the fields of embedding model fine-tuning, dataset construction, data enhancement, etc.

References:

  1. Hugging Face Datasets Library Documentation:  (https://huggingface.co/docs/datasets/)

This is the official documentation, where you can see how to load, process and manipulate various data sets.

  1. Reimers, N., & Gurevych, I. (2019).  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).  (arXiv:1908.10084) This is a classic paper on how to learn sentence representations in a supervised manner. It is very helpful for understanding how to train models with paired samples.

  2. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, WT (2020).  Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).  (arXiv:2004.04906) This paper discusses how important it is to select good positive and negative samples in dense retrieval, especially in the context of open-domain question answering.

  3. Settles, B. (2009).  Active Learning Literature Survey. University of Wisconsin-Madison Department of Computer Sciences Technical Report 1648. This is a classic review article in the field of active learning (although it is a little old, the core ideas in it are still very valuable for reference).

  4. Gao, L., Ma, X., Lin, J., & Callan, J. (2021).  Complementing Lexical Retrieval with Semantic Residual Embedding. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.  (arXiv:2109.04770) This paper focuses on how to combine sparse retrieval and dense retrieval, and also mentions how to obtain high-quality training data.

  5. Gao, T., Yao, X., & Chen, D. (2021).  SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).  (arXiv:2104.08821) SimCSE is a very influential work, which proposes a very concise but effective sentence representation contrast learning method. If you are interested in how to construct positive sample pairs through a simple Dropout mechanism and train them with negative samples within the batch. There are also some good interpretations on Zhihu, such as this article by Maple Xiaoqi (https://zhuanlan.zhihu.com/p/368353121), which can be used as an auxiliary understanding.

  6. Yoon, S., Kim, G., & Park, K. (2021).  SSMix: Saliency-based Span Mixup for Text Classification. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.  (arXiv:2106.08062) This paper proposes Saliency-based Span Mixup (SSMix), an innovative data augmentation technique for text data, which improves the robustness and generalization ability of the model by intelligently replacing fragments in the text.

  7. Chuang, YS, Li, R., Torralba, A., & Jegelka, S. (2022).  DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv preprint arXiv:2204.10298. DiffCSE is an interesting improvement based on SimCSE, which focuses on how to learn better sentence embeddings by distinguishing the differences between positive and negative sample pairs.

  8. Wang, Z., Wu, W., Wang, H., Wu, H., & Wang, W. (2020).  CLEAR: Contrastive Learning for Sentence Representation. arXiv preprint arXiv:2012.15466. CLEAR is also an important work in the field of contrastive learning in sentence representation. It explores how to combine word-level perturbations and intra-batch negative sampling to enhance representation learning.