How does AI "understand" text? Embedding vernacular analysis

Explore the secrets of how AI "understands" text and gain an in-depth understanding of Embedding technology.
Core content:
1. Embedding basics: converting information into digital vectors
2. Embedding model: introduction to machine learning models that generate vectors
3. Practical applications and importance of Embedding
1. What is Embedding?
In simple terms:
Embedding is the process of converting words, sentences or any type of information into a set of numbers (vectors) so that computers can understand and process them.
It is as if every word or concept is "placed" (embedded) in a huge space. Each word has its own unique "position" coordinate (digital vector). The closer the words are, the more similar their meanings are, and the farther the words are, the more different their meanings are.
For example:
"Cat" → [0.5, 1.2, 0.3, …]
“Dog” → [0.51, 1.19, 0.29, …]
They have similar meanings, so their coordinates are very close; whereas the coordinates of "apple" may be far apart because the meanings are unrelated.
2. What is Embedding Model?
Embedding Model refers to a machine learning model specifically used to generate embeddings.
Common embedding models such as Word2Vec, GloVe, FastText, BERT's Embedding layer, OpenAI's Text Embedding model (such as ada-002)
The purpose of these models is to:
Convert any text input (such as words, sentences or even entire articles) into a fixed-length digital vector to capture the semantic relationship behind the words or text.
3. What is the role of Embedding?
Embedding has very important practical application value:
Similarity search and recommendation
When you search for something similar to "apple", the embedding model can help you quickly find related words or concepts such as "pear" and "fruit".
Recommendation engines also often use embedding to make accurate recommendations.
2. Foundations of Natural Language Processing Tasks
As a preparatory step for LLM (Large Language Model): all text information will first be converted into embedding vectors before subsequent language understanding and generation tasks.
3. Classification Tasks and Sentiment Analysis
Embedding vectors allow computers to quickly determine the sentiment (positive, negative, neutral) or topic classification of a sentence.
4. Information Retrieval and Question Answering (RAG)
You can quickly find answers or document content that matches your questions, thereby achieving efficient intelligent question and answering.
IV. Relationship between Embedding and LLM
The Large Language Model (LLM) essentially first converts the user input (prompt) into a string of numbers through embedding to capture the meaning behind it.
Then, through complex neural network reasoning, the corresponding answers are output, and finally these numbers are converted back into text that humans can understand.
In other words: Embedding is the first and most important step for large language models to understand the world and achieve intelligent communication.
5. Detailed description of the relationship between RAG and Embedding
We all know that RAG can be simply divided into two steps:
Retrieval: After a user asks a question, the most relevant content fragment is first searched in a pre-prepared knowledge base.
Generation: The retrieved content and the user’s question are sent to the Large Language Model (LLM) to generate the final answer.
The Embedding Model plays a core role in the first step - the retrieval stage.
Details of the role of the Embedding Model in RAG
The Embedding Model is mainly responsible for:
1) Convert the knowledge base content and user questions into vectors
The Embedding model converts the documents or text fragments of the knowledge base into digital vectors in advance and builds a vector database .
Every time a user asks a question, the question itself is converted into a vector through the embedding model.
Example:
User question: “How can I reduce anxiety?”
→ Embedding: [0.21, 0.92, 0.15, …]
Document excerpt: "Meditation can effectively reduce anxiety..."
→ Embedding: [0.20, 0.93, 0.14, …]
2) Perform semantic similarity retrieval
After the Embedding model converts the question into a vector, the system quickly retrieves the most relevant text fragments from the vector database through vector similarity calculations (such as cosine similarity).
The higher the semantic similarity (i.e., the closer the vector distance), the more likely the fragment is to be the accurate answer or relevant knowledge to the user's question.
Example:
Vector similarity calculation results:
[User Issue] ↔ [Document Snippet 1] → 0.95 ✅
[User Question] ↔ [Document Snippet 2] → 0.52 ❌
Select the document fragment 1 with higher similarity and provide it to the large language model for further answer.
6. What is the relationship between Tokenization and Embedding?
What is Tokenization?
Tokenization refers to breaking down the original text into smaller units (called "tokens"). In large model training:
Token is the basic unit of model processing and can be:
Word-based
Subword (such as BPE or WordPiece)
Character-based
For example?
Original sentence:
"I like learning AI technology"
Possible results after tokenization:
["I", "like", "learning", "AI", "technology"]
Tokenization is to enable the model to process text data more efficiently. The model cannot process the entire sentence directly, but needs to break it down into small units that are easy to process.
2. What is Embedding?
Embedding refers to converting each token after the above decomposition into a digital vector.
The Embedding model assigns a fixed-length numeric vector to each token.
These vectors contain the semantic information of each token.
Another example ?
Token "like" is converted into Embedding vector:
"like" → [0.32, 1.20, -0.24, ...]
Token "learning" is converted into Embedding vector:
"learn" → [0.11, 0.85, -0.41, ...]
Embedding allows the model to understand and process the meaning behind tokens in a mathematical way.
3. What is the relationship between Tokenization and Embedding?
In the training of the model, their relationship is in order :
Step 1 : Tokenization (convert the sentence into a token sequence).
Step 2 : Embedding (convert each token into a vector).
That is:
Original text → Tokenization → Token sequence → Embedding → Digital vector sequence