A brief introduction to text generation and embedding and the process of fine-tuning the model

Explore the secrets of text embedding generation and master the key process of fine-tuning the model.
Core content:
1. The basic principles and steps of text embedding generation
2. Commonly used embedding models and their characteristics
3. Fine-tuning model process and technical intervention methods
How does text generate embedding?
Overall process:
Input text → Process it with the "Embedding model" → Output a vector (embedding)
for example
You input a sentence: "Artificial intelligence changes the world", and the Embedding model will output a vector like this:
[0.432, -0.115, ..., 0.981]
Each comma-separated part is a dimension. This vector is usually several hundred dimensions, such as 384 or 768. This vector represents the "position" of the sentence in the semantic space.
? What are the commonly used general technologies/models?
General pre-trained Embedding model (you can use it directly):
Sentence-BERT (SBERT) | ||
OpenAI Embedding models (such as text-embedding-3-small ) | ||
Cohere embeddings | ||
FastText | ||
Word2Vec / GloVe |
? Human intervention in the text vector generation process
1️⃣ Choose or train different models ( model selection)
Different Embedding models prefer different contexts and language styles
For example, in the fields of law, medicine, and coding, you can use specially trained data for “domain embedding”
2️⃣ Modify the input method ( Prompt project )
You can add some prompt words before and after the text to guide the model to "better understand the text":
Original text :
“An apple is a fruit.”
After transformation :
“That’s the definition of a fruit: an apple is a fruit.”
The resulting Embedding may be more in line with the "knowledge type" semantics you want.
3️⃣ Fine - tuning
If you have domain-specific data (such as company documents, contract corpus), you can fine-tune a pre-trained model.
The embedding obtained in this way is more consistent with the content of your knowledge base.
⚠️ Fine-tuning is costly and usually requires GPU resources and certain technical thresholds.
4️⃣ Normalization/Pooling strategy ( technical intervention )
The final vector of the sentence embedding is usually obtained by aggregating multiple token vectors output by the model (such as mean pooling).
You can choose:
Mean pooling
(average)CLS token
(BERT first position)max pooling
Different strategies affect vector quality, and can be tuned experimentally.
Fine-tuning the Embedding model allows you to get more targeted semantic representations on your data, especially for application scenarios such as specific industries (legal, financial, medical) or proprietary corporate documents (customer service chat records, product documentation, etc.).
The overall process of fine-tuning the Embedding model
Prepare data → Select model → Build training set (positive and negative samples) → Configure training parameters → Start training → Verify → Deploy
1. Preparation
To prepare training data, you need to construct semantic matching samples like this:
The data format is usually:
Sentence pairs (query and positive/negative documents)
or triples (query, positive, negative)
? Recommended format: JSONL (each line is a training sample)
{"query": "Return process", "positive": "Please complete the return application within seven days", "negative": "The customer service number is 400..."}
2. Choose the model architecture
Recommended model (Embedding model suitable for fine-tuning):
Sentence-BERT (SBERT) | ||
MiniLM / BERT base | ||
OpenAI Embedding Model |
3. Constructing the fine-tuning process (taking Sentence-BERT as an example)
use sentence-transformers
Libraries (by HuggingFace):
1. Install necessary libraries
pip install sentence-transformers
2. Constructing a training data loader
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Example data
train_examples = [
InputExample(texts=[ "Return process" , "Please complete the return application within seven days" ], label= 1.0 ),
InputExample(texts=[ "Return process" , "Contact customer service number is 400..." ], label= 0.0 ),
]
train_dataloader = DataLoader(train_examples, shuffle= True , batch_size= 8 )
3. Loading pre-trained models
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
4. Choose a loss function (mostly use CosineSimilarityLoss)
train_loss = losses.CosineSimilarityLoss(model=model)
5. Fine-tuning training
model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
6. Save the model
model.save('my-custom-embedding-model')
You can then use this model to generate your own embedding vector:
model = SentenceTransformer('my-custom-embedding-model') embedding = model.encode("What are your business hours?")
4. Tuning and Verification
The effect of fine-tuning can be verified in the following ways:
Is cosine similarity sorting more reasonable?
Are more relevant content recalled in RAG searches?
Is the embedding clustering visualization clearer? (Using t-SNE/UMAP)