How the Embedding Vector Model is Used in the RAG Local Knowledge Base

Written by
Iris Vance
Updated on:July-12th-2025
Recommendation

Explore the combination of RAG technology and Embedding model to open a new chapter of intelligent local knowledge base.

Core content:
1. The role and practical application of Embedding model in local knowledge base
2. Dimension selection of Embedding vector and model comparison analysis
3. Analysis of the working principle of word vectorization technology and Embedding model

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Background

       When we use retrieval-augmented generation (RAG) technology to build a local knowledge base, the Embedding  model is like an intelligent indexing system for this library, enabling us to find the knowledge we need quickly and accurately .

       For newbies who are new to this field, what exactly is the Embedding model and what role it plays in the local knowledge base may still be unclear .

       Next, we will discuss these issues in detail using easy-to-understand language and practical examples, and share how to prepare local knowledge so that local RAG answers questions more accurately and comprehensively .

2. What is the Embedding Model?

       Simply put, Embedding is like putting a " digital coat " on the data, converting the original various data (such as text, images, voice, etc.) into a set of numbers , that is, vector representation.

       Embedding vectors are essentially mathematical coordinates that map semantic information to high-dimensional space . Take 3D space as an example:

  • "cat"
     → [0.7, -0.3, 0.1]
  • "dog"
     → [0.68, -0.25, 0.15]
  • "engine"
     → [-0.4, 0.8, 0.5]

2.1 Dimensional explanation

  •        Low dimensionality (<100): weak semantic distinction ability, and "apple" (fruit) may be confused with "apple" (mobile phone)
  •        High dimensionality (>1000): requires more computational resources, but can capture the subtle differences between "running" and "jogging"
# Actual model dimension example
print ( "BGE dimension: " , len(model.encode( "Text example" )[0]))  # Output: 1024
print ( "Jina dimension: " , len(jina_model.encode( "example" )[0]))  # Output: 768

2.2 Depth comparison of commonly used models


characteristic

BGE-large-zh-v1.5

Jina-embeddings-v2

Nomic-embed-text-v1.5

Vector Dimensions

1024

768

768

Maximum text length

512 tokens

8192 tokens

2048 tokens

Recommended distance method

Cosine similarity

Dot product (normalization required)

Cosine similarity

Amount of training data

5 billion Chinese tokens

1.2 trillion multi-language tokens

200 million English documents

Multi-language support

Chinese language priority

Supports 100+ languages

English only

Model volume

1.2GB

890MB

420MB

Inference Memory Requirements

3GB

2.5GB

1.8GB

Temperature parameters

Not supported

Support dynamic temperature adjustment

Fixed temperature 0.8

Domain Adaptability

Legal/Financial Edition

General fields

Scientific paper optimization

       In building the local knowledge base, we focus on text  embedding, which can convert text information into a vector form that is easier for computers to understand and process.

3. Principles of Embedding Model

3.1. Word vectorization

  • One-hot encoding

       One-hot encoding is a simple and direct way to vectorize words. Imagine we have a fruit vocabulary with only three words: "apple", "banana" and "orange". We can assign a binary vector to each word, where only one position is 1 and the rest are 0. For example:

    • "Apple" is represented by [1, 0, 0]
    • "Banana" is represented by [0, 1, 0]
    • "Orange" is represented by [0, 0, 1]

       This representation has obvious disadvantages. If the vocabulary becomes very large, such a vector will be very long and cannot reflect the semantic relationship between words. For example, "apple" and "banana" are both fruits, but they seem to have nothing to do with each other in this representation method.

  • Word embeddings

       Word embedding is a more advanced word vectorization method. Models such as word2vec and glove are all word embedding models.

       Take word2vec as an example, it learns word vectors by predicting the context of words. Suppose we have a sentence like "I like to eat apples", word2vec will learn the vector representation of "apple" based on the words around "apple" ("like" and "eat").

       Eventually, semantically similar words will be close together in the vector space. For example, the vectors of “apple” and “banana” will be close together because they are both fruits.

3.2. Sentence Vectorization

  • Simple average/weighted average
           Simple averaging is to add up the vectors of each word in the sentence and then divide it by the number of words to get the vector of the sentence.

For example, in the sentence "I love my motherland", assuming that the vector of "I" is [0.1, 0.2, 0.3], the vector of "love" is [0.4, 0.5, 0.6], and the vector of "motherland" is [0.7, 0.8, 0.9], then the sentence vector is the sum of these three vectors divided by 3. The weighted average is to assign different weights to the vectors of each word according to the importance of the word, and then calculate it. For example, in a sentence, the weight of the key words can be set higher.

  • Recurrent Neural Networks (RNNs)
           RNN is like a little robot with memory. It processes each word in the sentence one by one according to the order of the words in the sentence and remembers the information it has processed before.

       For example, for the sentence "I went to the supermarket to buy apples today", RNN will first process "I", then combine the information of "I" to process "today", and so on, and finally generate a vector representation of the entire sentence. They can handle long sentences better because they have special mechanisms to avoid losing important information when processing long sequences.

  • Convolutional Neural Networks (CNNs)
           CNN is like an inspector with a magnifying glass. It slides a small window (convolution kernel) across the sentence to capture local features in the sentence.

       For example, in the sentence "The pizza in this restaurant is delicious", CNN may discover local features such as "the pizza is delicious" through convolution kernels, and then generate sentence vectors based on these features.

  • Self-attention mechanism (such as Transformer)
           The self-attention mechanism is like a smart secretary. It pays attention to the relationship between each word in a sentence and other words. For the sentence "Xiao Ming and Xiao Hong go to the park to play together", it analyzes the relationship between "Xiao Ming" and "Xiao Hong", as well as their relationship with "park" and "play", and then combines this information to generate a sentence vector. This can better capture the complex semantic relationship between words in a sentence.

3.3. Document Vectorization

  • Simple average/weighted average
           Similar to sentence vectorization, each sentence vector in the document is averaged or weighted averaged to obtain the vector representation of the document.

       For example, a document about travel may contain multiple sentences describing different scenic spots. The document vector is obtained by averaging these sentence vectors.

  • Document Topic Model
          Like a master classifier, it analyzes the distribution of words in a document to find out the topic of the document.

       For example, if the words "football", "match", "player" and so on appear multiple times in a document, it may be judged that the topic of this document is football. Then, the vector representation of the document is generated based on the topic distribution.

  • Hierarchical models (such as doc2vec)
           Doc2vec is an extension of word2vec, which not only considers the words in the document, but also the overall information of the document.

       For example, if there is a series of review documents about different movies, doc2vec can learn a unique vector representation for each document while also reflecting the similarities between documents.

3.4 The Essence of the Distance Formula

Assume vector A=[a₁,a₂,a₃], B=[b₁,b₂,b₃]


method

formula

Applicable scenarios

Cosine similarity

Σ(aᵢ bᵢ)/√(Σaᵢ²) √(Σbᵢ²)

Text search (eliminating length effect)

Euclidean distance

√(Σ(aᵢ - bᵢ)²)

Image matching/structure data

Manhattan distance

Σ|aᵢ - bᵢ|

Sparse feature analysis

Dot product similarity

Σ(aᵢ*bᵢ)

Fast calculation of normalized vectors

Mahalanobis distance

√((AB)ᵀ * Σ⁻¹ * (AB))

Consider the scenario of feature correlation

3.5 Hybrid search strategy

    A[User Question]  --> B{Question Type Identification}
    B  -->|Technical terms| C[Semantic search]
    B  -->|Date/Number| D[Keyword Search]
    C  --> E[reorder]
    D  --> E
    E  --> F[final result]

Comparison of precision and recall :


method

Recall rate (rough statistics)

Accuracy (rough statistics)

Pure semantic search

78%

82%

Pure keyword search

65%

95%

Hybrid Search

92%

89%

       Recall is an indicator we need to pay attention to. Suppose we have a collection of 100 documents, 30 of which are about "artificial intelligence". We use an information retrieval system to find documents about "artificial intelligence". If the system returns 20 documents, 15 of which are indeed about "artificial intelligence", and the other 15 documents about "artificial intelligence" are not returned by the system, then according to the calculation formula of recall, we can get: recall = 50%. This means that the system only retrieved half of the documents related to "artificial intelligence".

IV. Usage of Embedding Model in Local Knowledge Base

       In the local knowledge base RAG system, the Embedding model is mainly responsible for two important things:

4.1. Local Data Vectorization

       The text data in the local knowledge base, such as documents and materials, are converted into vectors through the Embedding model and then stored in the vector database.

       For example, an e-commerce company has documents such as product manuals, user reviews, FAQs, etc. We can use the Embedding model to convert these documents into vectors sentence by sentence or paragraph by paragraph.

       Suppose there is a sentence like this in the product manual: "This phone uses the latest processor and has very powerful performance." The Embedding model will convert it into a vector and then store it in the vector database, just like putting this page of the book in a specific location in the library.

2. User Input Vectorization

       When a user asks a question, the Embedding model is also used to convert the question into a vector, and then the local data vector most similar to the question vector is searched in the vector database.

       For example, if a user asks "How is the performance of this phone?", the Embedding model will convert this question into a vector and then search the vector database for the closest vector. It may find the previously stored vector describing the phone's performance and locate the corresponding text content, just like quickly finding the relevant book page in a library based on the index.

5. How to prepare local knowledge to improve the accuracy and comprehensiveness of local RAG answers

5.1 Data Collection

       Data collection is like collecting building materials for our knowledge base. Try to collect as many data as possible related to the topic as possible.

       Taking building a local knowledge base about food as an example, we can collect data from the following aspects:

  • Professional books: such as gourmet cooking textbooks, local gourmet culture books, etc. These books usually contain rich theoretical knowledge and classic recipes.
  • Authoritative websites: food websites, official websites of the catering industry, etc. These websites will have the latest food information, restaurant recommendations, special dishes introductions, etc.
  • Academic papers: Academic papers on food science, food culture, etc. can provide more in-depth theoretical support for the knowledge base.
  • Social media: sharing by food bloggers, user food experience reviews, etc. These contents can reflect the public's taste preferences and actual food experience.

5.2 Data Cleansing

      The collected data may be like a pile of messy stones with a lot of impurities that need to be cleaned.

     For example, the food data we collect from the Internet may contain duplicate content, erroneous information, and irrelevant advertisements. We need to clean up these impurities and only keep accurate content related to food knowledge.

5.3 Data Structuring

       Data structuring is like classifying and organizing our building materials to facilitate subsequent search and use. For the food knowledge base, we can classify it according to the following structure:

  • Cuisine classification: such as Sichuan cuisine, Cantonese cuisine, Shandong cuisine, etc. Each cuisine is further subdivided into specific dishes.
  • Dish types: divided into stir-fried dishes, stews, soups, desserts, etc.
  • Food classification: Classification is based on the main ingredients, such as meat, vegetables, seafood, etc.

       Through such structured classification, we can manage and retrieve data more conveniently. For example, when users want to find stir-fried dishes in Sichuan cuisine, they can quickly locate relevant dish information.

5.4 Data Annotation

       Labeling data can make the knowledge base more intelligent.

       In the food knowledge base, we can mark the difficulty level of the dishes (easy, medium, difficult), suitable groups of people (children, the elderly, fitness people, etc.), cooking time and other information.

      In this way, when searching, more accurate matching can be performed based on the user's specific needs.

      For example, if the user is a fitness enthusiast and wants to find a low-calorie, high-protein dish, he or she can quickly filter out food that meets the requirements through the labeled information.

VI. Conclusion

      By understanding these technical details, even beginners can make professional choices. When processing Chinese contract documents, choose the BGE model with cosine similarity; when processing multinational customer service conversation records, Jina's multi-language support is a better choice;

       Remember, the basis for choosing a model is the termination basis - the right one is the best!