A brief introduction to text generation and embedding and the process of fine-tuning the model

Written by
Silas Grey
Updated on:June-18th-2025
Recommendation

Explore the secrets of text embedding generation and master the key process of fine-tuning the model.

Core content:
1. The basic principles and steps of text embedding generation
2. Commonly used embedding models and their characteristics
3. Fine-tuning model process and technical intervention methods

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

How does text generate embedding?

Overall process:


Input text  → Process it with the "Embedding model"  → Output a vector (embedding)


for example

You input a sentence: "Artificial intelligence changes the world", and the Embedding model will output a vector like this:

[0.432, -0.115, ..., 0.981]

Each comma-separated part is a dimension. This vector is usually several hundred dimensions, such as 384 or 768. This vector represents the "position" of the sentence in the semantic space.

? What are the commonly used general technologies/models?


General pre-trained Embedding model (you can use it directly):


Model
frame
Features
Sentence-BERT (SBERT)
PyTorch / HuggingFace
Convert sentences/paragraphs into semantic vectors, very commonly used
OpenAI Embedding models (such as text-embedding-3-small
OpenAI API
High precision, simple deployment, good commercialization
Cohere embeddings
Cohere API
Multi-language support, commercial interface
FastText
Facebook
Suitable for word level, supports subwords
Word2Vec / GloVe
Classic word embedding
Fast but not suitable for statement-level tasks


? Human intervention in the text vector generation process


1️⃣ Choose or train different models ( model selection)


  • Different Embedding models prefer different contexts and language styles

  • For example, in the fields of law, medicine, and coding, you can use specially trained data for “domain embedding”


2️⃣ Modify the input method ( Prompt project )


You can add some prompt words before and after the text to guide the model to "better understand the text":

Original text :

“An apple is a fruit.”

After transformation :

“That’s the definition of a fruit: an apple is a fruit.”

The resulting Embedding may be more in line with the "knowledge type" semantics you want.

3️⃣ Fine - tuning


  • If you have domain-specific data (such as company documents, contract corpus), you can fine-tune a pre-trained model.

  • The embedding obtained in this way is more consistent with the content of your knowledge base.


⚠️ Fine-tuning is costly and usually requires GPU resources and certain technical thresholds. 

4️⃣ Normalization/Pooling strategy ( technical intervention )


The final vector of the sentence embedding is usually obtained by aggregating multiple token vectors output by the model (such as mean pooling).

You can choose:

  • Mean pooling(average)

  • CLS token(BERT first position)

  • max pooling


Different strategies affect vector quality, and can be tuned experimentally.



Fine-tuning the Embedding model allows you to get more targeted semantic representations on your data, especially for application scenarios such as specific industries (legal, financial, medical) or proprietary corporate documents (customer service chat records, product documentation, etc.).

The overall process of fine-tuning the Embedding model


Prepare data → Select model → Build training set (positive and negative samples) → Configure training parameters → Start training → Verify → Deploy

1. Preparation

To prepare training data, you need to construct semantic matching samples like this:


Query
Positive
Negative samples (optional)
“How does the return process work?”
"Users must submit a return application within 7 days..."
"Product Manual Introduction"
“What are the opening hours?”
“Store hours are from 9am to 8pm”
"Recruitment Information"

The data format is usually:

  • Sentence pairs (query and positive/negative documents)

  • or  triples (query, positive, negative)


? Recommended format: JSONL (each line is a training sample)

{"query": "Return process", "positive": "Please complete the return application within seven days", "negative": "The customer service number is 400..."}

2. Choose the model architecture


Recommended model (Embedding model suitable for fine-tuning):


Model
advantage
frame
Sentence-BERT (SBERT)
Specially designed for sentence embedding, allowing for quick fine-tuning
PyTorch / HuggingFace
MiniLM / BERT base
High precision and fast speed
HuggingFace
OpenAI Embedding Model
Strong commercialization, but unable to fine-tune
OpenAI API (closed source)


3. Constructing the fine-tuning process (taking Sentence-BERT as an example)


use sentence-transformersLibraries (by HuggingFace):

1. Install necessary libraries

pip install sentence-transformers
2. Constructing a training data loader
from  sentence_transformers  import  SentenceTransformer, InputExample, lossesfrom  torch.utils.data  import  DataLoader
# Example datatrain_examples = [    InputExample(texts=[ "Return process""Please complete the return application within seven days" ], label= 1.0 ),    InputExample(texts=[ "Return process""Contact customer service number is 400..." ], label= 0.0 ),]
train_dataloader = DataLoader(train_examples, shuffle= True , batch_size= 8 )
3. Loading pre-trained models
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
4. Choose a loss function (mostly use CosineSimilarityLoss)
train_loss = losses.CosineSimilarityLoss(model=model)
5. Fine-tuning training
model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
6. Save the model
model.save('my-custom-embedding-model')
You can then use this model to generate your own embedding vector:
model = SentenceTransformer('my-custom-embedding-model') embedding = model.encode("What are your business hours?")
4. Tuning and Verification

The effect of fine-tuning can be verified in the following ways:

  • Is cosine similarity sorting more reasonable?

  • Are more relevant content recalled in RAG searches?

  • Is the embedding clustering visualization clearer? (Using t-SNE/UMAP)