Woter AI detection.Hurry - ends Jun 28th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

AI Search and Vector Data-How do models encode information and data into knowledge?

Written by

Jasper Cole

Updated on:June-20th-2025

For each concept and knowledge point, we provide a more in-depth and detailed analysis and description based on its application principles, specific application scenarios, and application methods in practice.

1. Intro to Vector Databases

Core idea: Encode the semantic meaning of the data as a vector in a high-dimensional space, and measure the similarity of the data by calculating the distance between vectors (such as cosine similarity and Euclidean distance), thereby achieving a search based on "meaning" rather than "literal meaning".

Unstructured Data Objects:

Internal documents:
Meeting minutes, reports, emails, chat logs, etc.
Internet content: news articles, blog posts, social media updates, user comments.
Multimedia materials: product photo gallery, surveillance video clips, and audio files uploaded by users.

Application principle:
Traditional databases rely on structured schemas for querying and cannot effectively process text, images, audio, and video data with diverse formats and no fixed structure. Vector embedding technology can convert these objects into a unified numerical representation (vector), making them computable and comparable.

Application method:
First, you need to identify the data type, then select the appropriate tool or process (such as OCR, text extraction library, audio and video transcription tool) to convert it into a processable format (such as plain text, image file), and then feed it into the embedding model.

Vector:

Application principle:
As the coordinate representation of data in a high-dimensional semantic space, each dimension of the vector represents a part of an abstract feature of the data. Vectors with similar distances in the space usually represent more semantically similar original data.
Application scenarios:
Stored in a vector database as the basic unit for data indexing and querying.
Application method:
Generated by the embedding model, usually represented as an array of floating point numbers or integers. Its dimensionality (e.g. 384, 768, 1536 dimensions) is determined by the selected embedding model.

Vector Embedding:

Natural Language Processing (NLP):
Text classification, sentiment analysis, question answering system, machine translation, semantic search.
Computer Vision (CV):
Image retrieval (image search), face recognition, object detection, and image similarity recommendation.
Recommended system:
Make recommendations based on similarity of user behavior or item content.
Multimodal Applications:
Cross-modal retrieval of text and images.

Application principle:
Based on the idea of "representation learning" in machine learning (especially deep learning), the model maps the input complex data (such as words, sentences, image pixels) into a low-dimensional (relative to the complexity of the original data) but information-dense vector space by learning a large amount of data, so that semantically similar objects are closer in space. This usually takes advantage of the "distribution hypothesis" (words/objects in similar contexts have similar meanings).

Application method:
Select a suitable pre-trained embedding model (such as Sentence-BERT for sentences, CLIP for images and text, Word2Vec for words), input the pre-processed data into the model, and obtain the output vector representation. Depending on the task, the model may need to be fine-tuned.

Embeddings Model:

Select Model:
Choose an appropriate pre-trained model based on the data type (text, image, audio, multimodal) and specific task (general semantics, domain-specific) (e.g. the Hugging Face Transformers library provides a large selection).
Model deployment:
Deploy models as services or integrate into data processing pipelines.
Fine-tuning (Optional):
If a general model does not work well in a specific domain, you can use domain-related annotated data to fine-tune the model to improve the embedding quality.

Application principle:
Use neural networks (such as Transformer, CNN, RNN) to learn patterns and relationships in data. For text, the model learns the contextual relationship between words and sentences; for images, it learns the relationship between pixels, textures, shapes, and objects. Multimodal models learn the correspondence between different modal data.
Application scenarios:
It is the core tool for generating vector embeddings and is used in all scenarios where data needs to be converted into vector representations.

Vector-Based Index:

HNSW principle:
Build a hierarchical graph, and when searching, start from the sparse top-level graph and jump widely, and gradually sink to the denser bottom-level graph for precise search.
IVF Principle:
First, the vector space is clustered into multiple Voronoi cells, and the index stores the cell to which each vector belongs and the list of vectors in the cell. When searching, the cells closest to the query vector are first located, and then precise searches are performed in these cells.

Application principle:
Solve the "curse of dimensionality" problem of accurate nearest neighbor search in high-dimensional space. By constructing special data structures (such as graphs, trees, hashes), vectors are organized so that the search can quickly locate the area that may contain the nearest neighbor, sacrificing a very small amount of accuracy (finding the approximate nearest neighbor ANN) in exchange for an order of magnitude increase in speed.

Application scenarios:
The core component of the vector database is used to accelerate similarity search. Without efficient indexing, searching large vector datasets will be very slow.
Application method:
When creating a collection in a vector database, you usually need to specify the index type (such as HNSW, IVF_FLAT, IVF_PQ) and related parameters (such as M and efConstruction for HNSW , nlist for IVF ). The database will automatically build or update the index when data is inserted. When querying, balance the search speed and recall rate by setting search parameters (such as efSearch for HNSW , nprobe for IVF ).

Vector Database:

Semantic Search Engines:
Provide a search experience that is smarter than keyword matching and better understands user intent.
Recommended system:
Recommend similar content based on user portrait vectors or item vectors.
Question Answering and Dialogue (RAG):
Quickly retrieve contextual information relevant to user questions from large collections of documents.
Image/Video/Audio Library Management:
Realize functions such as image search and similar audio search.
Anomaly Detection:
Look for outliers that deviate far from the normal data pattern (vector).

Application principle:
A system that integrates vector storage, efficient ANN indexing, vector query (based on similarity measurement), and traditional database management functions (such as data addition, deletion, modification, and query, metadata filtering, scalability, and high availability).

Application method:

Selection:
Choose a suitable vector database (Weaviate, Pinecone, Milvus, Qdrant, etc.) based on performance requirements, data size, deployment method (cloud service vs. local deployment), ecosystem integration, and other factors.
Deployment and Configuration:
Install and deploy the database, create a collection, and configure vector dimensions, distance metrics, index types, and parameters.
Data import:
The original data is converted into vectors through the embedding model and imported into the database together with its metadata.
Query:
Initiate a vector similarity search request through the SDK or API, and filter based on metadata.
Operations:
Monitor database performance and perform operations such as expansion and contraction, backup and recovery as needed.

2. Search

Core idea: Combine the advantages of different search paradigms and make up for their shortcomings to achieve better comprehensive search results. In particular, combine the accuracy of keywords and the semantic understanding ability of vector search.

Sparse Vectors:

Application principle:
Based on word frequency statistics and inverse document frequency, the vector dimension is equal to the vocabulary size, with each dimension corresponding to one word. The vector value usually indicates the importance of the word in the document (such as TF-IDF or BM25 score), and most dimensions are zero. It captures the "exact match" information of the bag of words model.
Application scenarios:
Traditional keyword search engines, scenarios that require exact matching of terms or IDs, and the keyword part of hybrid searches.
Application method:
Use libraries such as Scikit-learn's TfidfVectorizer or the specialized BM25 library to generate sparse vectors from text. In some vector databases (such as Weaviate), BM25 can be used directly for keyword retrieval, and the underlying mechanism is similar to sparse vectors.

Dense Vectors:

Application principle:
The input (text, image, etc.) is mapped into a lower-dimensional continuous vector space through a deep learning model, capturing the semantic, contextual, and abstract features of the input rather than the surface vocabulary.
Application scenarios:
Semantic search, similarity recommendation, clustering, classification, semantic retrieval part in RAG.
Application method:
Use pre-trained embedding models (such as SBERT, CLIP) to generate dense vectors and store them in the vector database for similarity search.

BM25/BM25F:

Application principle:
Based on the probabilistic information retrieval model, TF-IDF is improved. It takes into account term frequency (TF), inverse document frequency (IDF), document length normalization (penalizing too long documents) and term frequency saturation (after the term frequency increases to a certain level, its contribution to relevance no longer increases linearly). BM25F further allows different weights to be given to different fields of the document (such as title, body).
Application scenarios:
Keyword ranking algorithm widely used in modern search engines; calculate keyword relevance scores in hybrid searches.
Application method:
It can be used as an independent sorting function library or integrated into search engines (such as Elasticsearch, OpenSearch) or some vector databases (such as keyword search modules). It requires providing query terms and document collections.

Alpha Parameter:

Application principle:
In the final fusion stage of hybrid search, it is used as a hyperparameter to linearly combine (or otherwise) the relevance scores from different retrieval strategies (such as vector search and keyword search). It determines whether the final ranking is more biased towards semantic matching or keyword matching.
Application scenarios:
Adjust the behavior of the hybrid search system to suit different types of queries or datasets. For example, you might want to give more weight to keywords for queries containing product codes, and more weight to vectors for descriptive queries.
Application method:
When blending search queries, specify an alpha value (usually between 0 and 1). The best value usually needs to be determined through experimentation and evaluation. For example, blended search queries allow you to specify an alpha parameter.

Fusion Algorithm:

Application principle:
It aims to merge the result lists from multiple (usually heterogeneous) ranking systems to generate a single, better ranked list. The common RRF (Reciprocal Rank Fusion) algorithm does not rely on the size of the original score, but calculates the fusion score based on the ranking of the document in each list, giving higher weight to the top-ranked results, and has better robustness.
Application scenarios:
Hybrid search (combining vector search and keyword search results), integrated search (combining results from different search engines or databases).
Application method:
Get the result list (document ID and ranking) of each search system, apply RRF or other fusion algorithms (such as CombSUM, CombMNZ) to calculate the fusion score of each document, and then finally sort all documents according to the fusion score.

Hybrid Search:

E-commerce search:
Users may search for exact models (keywords) or descriptive requirements (semantics).
Enterprise Knowledge Base Search:
You need to find documents that contain a specific term, but also documents that discuss related concepts.
General Web Search:
Balance exact matching and intent understanding.

Application principle:
Perform keyword search (taking advantage of its precision to process terms, codes, names, etc.) and vector search (taking advantage of its semantic understanding to process synonyms, near synonyms, and concept matching) at the same time, and then merge the results through a fusion algorithm. The goal is to take advantage of each other's strengths and make up for each other's weaknesses, and improve the overall search recall and precision.

Application method:

Use a database or framework that supports hybrid search (such as some plugins for Elasticsearch).
Alternatively, query the keyword system and the vector system separately, and then implement the fusion logic yourself.
It is necessary to adjust the alpha parameter and select a suitable fusion algorithm.

Keyword Search:

Application principle:
Based on the exact match or word form changes of words in the text (such as stemming, word form restoration). Usually an inverted index is used to speed up the search for documents containing the query terms, and algorithms such as BM25 or TF-IDF are used to rank them.
Application scenarios:
Need to find documents containing specific words, codes, identifiers. The core of traditional search engines.
Application method:
Use search engines such as Elasticsearch, OpenSearch, Solr, or the full-text search function of the database.

Semantic/Vector Search:

Application principle:
Both the query and the document are converted into vector embeddings, and then the document vector that is most similar (closest) to the query vector is found in the vector space. Similarity calculations (such as cosine similarity) measure the degree of semantic closeness.
Application scenarios:
Search that understands user intent, question-answering systems, recommendation systems, image search, and other scenarios that require matching based on "meaning".
Application method:
Use a vector database or a library that supports vector search (such as Faiss, Annoy). The data needs to be embedded first.

3. Hierarchical Navigable Small World (HNSW)

Core idea: Accelerate the approximate nearest neighbor search in high-dimensional space by building a multi-layer proximity graph.

Hierarchical Graph Structure:

M : The maximum number of connections for each node in the graph, which affects the density and memory usage of the graph.
efConstruction : The size of the dynamic neighbor list when building the index, which affects the index construction time and quality.

Application principle:
Simulates the characteristics of small-world networks (such as social networks) in the real world. The top-level graph has few nodes and long edge connections (similar to highways), which is used to quickly cross the vector space; the bottom-level graph has many nodes and short edge connections (similar to local roads), which is used for fine-grained search in the target area. Each layer is a proximity graph, where nodes are vectors and edges connect similar vectors.
Application scenarios:
As the core implementation of efficient ANN indexing in vector databases (such as Weaviate, Milvus, Qdrant) or vector retrieval libraries (such as Faiss, NMSLIB).
Application method:
When creating an index, select the HNSW type and configure key parameters:

ANN - Approximate Nearest Neighbor:

Application principle:
In high-dimensional space, it is very time-consuming to accurately find the nearest neighbor. ANN algorithms aim to quickly find "close enough" neighbors of a query point with a very high probability, sacrificing absolute accuracy in theory in exchange for a huge speed increase in practical applications. Various ANN algorithms use different strategies (such as graphs, trees, hashing, quantization) to compress the search space or the vector itself.
Application scenarios:
All scenarios that require fast similarity search on large-scale high-dimensional data, such as vector database query, recommendation system, clustering, etc.
Application method:
Select and configure a specific ANN indexing algorithm (such as HNSW, IVF, LSH, PQ). When querying, you can set parameters (such as efSearch for HNSW ) to balance search speed and recall/precision.

Layered Navigation:

Application principle:
The core mechanism of HNSW search. Starting from one (or more) entry points in the top-level graph, greedily move along the edges to nodes closer to the query vector in the current level. When no closer nodes can be found in the current level, descend to the next level, use the current node as the entry point, and continue the greedy search. This process is repeated until the bottom level (level 0) is reached, and a final refined search is performed at that level.
Application scenarios:
The internal workflow of the HNSW index when executing a query.
Application method:
The user initiates the search through the query interface, and the HNSW module inside the database performs this navigation process. The user can usually control the depth and breadth of the search by setting the efSearch parameter (dynamic neighbor list size) during the query , thereby affecting the speed and accuracy.

Graph-Based Index:

Application principle:
Treat each vector in the dataset as a node in the graph, and add an edge between two vectors if they are similar enough (close). When searching, start from a certain entry node and jump between neighbors along the edges of the graph, gradually approaching the query vector. HNSW is an advanced hierarchical implementation of this approach.
Application scenarios:
It is particularly suitable for ANN search of high-dimensional data and has better overall performance than tree-based indexes (performance degrades in high dimensions) and LSH (which usually requires longer hash codes to achieve high accuracy).
Application method:
HNSW is one of the most commonly used graph-based indexes. Other graph indexing algorithms (such as NSG, FANNG) also exist, but HNSW is widely adopted due to its robustness and performance.

4. Multimodal RAG

Core idea: Extend the capabilities of RAG from plain text to include multiple data types such as images, audio, and video, so that LLM can understand and generate based on the retrieved mixed information.

Cross Modal Reasoning:

Visual Question Answering (VQA):
Answer questions about images based on their content.
Image/Video Caption Generation:
Generate text descriptions for images or videos.
Text to Image Generation:
Generate an image from a text description.
Multimodal RAG:
Understand the mixed context of retrieved images, texts, audio and video clips, etc.

Application principle:
AI models are able to understand and relate information from different modalities (such as text descriptions and image content). This usually relies on mapping data from different modalities into a shared semantic space (via multimodal embedding models) so that the model can compare and integrate them.

Application method:
Rely on powerful multimodal models (such as CLIP, BLIP, Flamingo, GPT-4V, etc.), which learn cross-modal associations during training.

Multimodal Embedding Model:

Cross-modal retrieval:
Search images with text, search text with images, search videos with images, etc.
Retrieval phase of multimodal RAG:
Retrieve related images or video clips using text queries and vice versa.

Application principle:
Through a specific network structure (such as dual encoders or fusion encoders) and training objectives (such as contrastive learning), a unified vector space is learned so that semantically related objects of different modalities (such as pictures of "dog" and texts of "dog") are close in distance in this space. CLIP uses contrastive learning and is trained on a large number of image-text pairs, forcing matching image-text pairs to have similar embeddings and mismatching ones to be far apart.

Application method:
Use a pre-trained multimodal embedding model (such as the CLIP model in the sentence-transformers library) to input data from different modalities and obtain vector embeddings that can be compared in the same space.

Multimodal contrast fine-tuning:

Application principle:
An application of contrastive learning, specifically for optimizing multimodal embeddings. By constructing positive pairs (such as matching images and captions) and negative pairs (such as mismatched images and captions), the model is trained to bring the embedding distance of positive pairs closer and the embedding distance of negative pairs farther apart. This makes the embedding space more sensitive to semantic relationships.
Application scenarios:
Improve the performance of multimodal embedding models for specific domains or tasks, such as fine-tuning on medical images and reports, or specific product images and descriptions to improve retrieval accuracy.
Application method:
Collect paired multimodal data in the target domain and further train the pre-trained multimodal embedding model using a contrastive loss function (such as InfoNCE loss).

Any-to-any Search:

Application principle:
Based on a unified multimodal embedding space, data of any modality can be encoded into a vector and used as a query to retrieve data of any other modality with a similar vector distance in the database.
Application scenarios:
A powerful multimodal search engine and digital asset management system that allows users to find all types of relevant content in the most convenient way (text, images, audio).
Application method:
We need a vector database that supports storing and querying multimodal embeddings, and a multimodal model that can generate high-quality shared spatial embeddings. When querying, we first convert the query (regardless of the modality) into a vector and then perform a similarity search in the database.

Multimodal RAG (Card Description):

Interpreting diagrams or complex images:
Users upload pictures to ask questions, RAG retrieves relevant textual knowledge, and LLM combines the two to answer.
Questions and answers based on video content:
Retrieve video keyframes and associated text to answer questions about video content.
Generate product descriptions or reports that include images.

Application principle:
It combines the retrieval enhancement ideas of RAG with multimodal processing capabilities. When a user asks a query, the system not only retrieves text, but also images, audio, video clips, etc., and provides this multimodal information together with the original query to an LLM (such as GPT-4V, Gemini) that can understand multimodal input, which generates the final answer.

Application method:
Build process:
1. Use the multimodal embedding model to embed the multimodal data in the knowledge base and store it in the vector database.
2. Embed the user query (which could be text or image, etc.).
3. Perform cross-modal retrieval in the vector database to obtain relevant multimodal data blocks.
4. Input the query together with the retrieved multimodal context (which may need to be encoded in a special format) to the multimodal LLM.
5. LLM generates a response.

5. Databases

Core idea: Provide appropriate management and access mechanisms for different types of data (structured, unstructured, relational, graph-like) and different application requirements (storage, retrieval, analysis, scalability).

Graph Database:

Social Network Analysis:
Friendships, community discovery, and influence analysis.
Recommendation Engine:
Recommendations are made based on user-item interactions and item similarities.
Fraud Detection:
Identify unusual connection patterns and linked accounts.
Knowledge Graph:
Build and query networks of entities and their relationships.
Network and IT Operations:
Dependency analysis and impact scope determination.

Application principle:
Based on graph theory, nodes are used to represent entities, edges are used to represent relationships between entities, and properties are used to store node and edge information. Query languages (such as Cypher and Gremlin) optimize traversal and pattern matching operations on complex relationships (such as finding friends of friends, shortest paths, etc.).

Application method: Select a graph database (such as Neo4j, ArangoDB, NebulaGraph), define the graph model (node labels, edge types, attributes), import data, and use graph query languages for query and analysis.

Inverted Indexes:

Application principle:
The core is a data structure, which usually consists of two parts: a term dictionary and a postings list. The dictionary stores the terms (or other index units) that appear in all documents and points to their postings lists. The postings list records the document ID, term frequency, position, and other information containing the term. When querying, first search for the query term in the dictionary, obtain the corresponding postings list, and then find matching documents by merging (such as AND operation to obtain intersection, OR operation to obtain union) the postings lists of multiple terms.
Application scenarios: the core of full-text search engines (such as Elasticsearch, Lucene); the text search function of the database; traditional information retrieval systems.
Application method: When creating an index in a search engine or a database that supports full-text search, the system will automatically build an inverted index for the specified text field. Users enter keywords through the query interface to search.

Sharding:

Application principle:
A form of horizontal partitioning that distributes the rows (or documents) of a database (table or collection) across multiple independent database instances (shards). A shard key is used to determine which shard a row belongs to. When querying, the routing component (such as mongos in MongoDB) directs the request to the corresponding shard based on the shard key in the query condition, or broadcasts the request to all shards and aggregates the results when it cannot be determined.
Application scenarios: Processing large-scale data sets that exceed the capacity or performance limit of a single machine; improving the concurrency of writing and reading; improving system availability (a shard failure does not affect other shards). Widely used in large relational databases, NoSQL databases, and vector databases.
Application method: Configure at the database level. You need to select a suitable sharding key (which affects data distribution balance and query efficiency) and determine the sharding strategy (such as hash sharding, range sharding). The database cluster usually automatically handles data distribution and query routing. For example, in Milvus, you can configure the number of shards ( shard_num ) of a Collection.

Multi Tenancy:

Application principle:
On shared infrastructure (hardware, database instances), multiple customers (tenants) are served through logical isolation at the software level (such as using tenant IDs to distinguish data, independent access control policies, and resource quota limits). Each tenant feels like they are using an independent system, but in fact they share resources, thereby reducing costs.
Application scenarios: SaaS (Software as a Service) applications, such as CRM systems, project management tools, cloud database services (including many vector database cloud services).
Application method: For service providers, it is necessary to design a tenant isolation mechanism at the application architecture level. For users, they usually obtain an isolated tenant environment when registering for a service, access it through an API Key or a specific account, and do not need to care about the underlying sharing details.

Relational Database:

Application principle:
Based on the relational model, data is stored in tables with fixed columns (fields) and data types, and rows (records) represent instances. Entity integrity and referential integrity are enforced through primary keys and foreign keys. SQL is used for data definition, operation, and query, supporting ACID transactions (atomicity, consistency, isolation, and persistence) to ensure data consistency and reliability.
Application scenarios: transaction processing systems (OLTP), business data management, scenarios that require strong consistency guarantees, and storage of structured data. For example, order management, user accounts, and inventory systems.
Application method: Design the database schema, create the table structure, and use SQL statements to perform add, delete, modify, and query operations. Query performance can be optimized through indexes (such as B-Tree indexes).

PQ (Product Quantization):

Application principle:
A vector lossy compression technique designed to reduce storage space and speed up distance calculations.

segmentation:
Split the original high-dimensional vector (say, D dimensions) into M low-dimensional sub-vectors (each of D/M dimensions).
Clustering: Cluster the i-th subvector of all vectors in the data set (usually using K-Means) to obtain K cluster centers (codewords) to form a codebook for the i-th subspace.
Quantization: For any original vector, replace each of its subvectors with the ID of the closest codeword in the subspace codebook to which it belongs.
Storage: The original vector is compressed into a short code consisting of M codeword IDs.
Distance calculation: The distance between compressed vectors can be quickly estimated by pre-calculating the distance between codewords or using approximate distances (such as asymmetric distance calculation ADC).

Application scenarios:
Large-scale vector similarity search, often used in conjunction with other indexing structures such as IVF (called IVF_PQ), to compress vectors stored in inverted lists, or directly as an indexing method.
Application method:
When creating an index in a vector database or library (such as Faiss), select the PQ or IVF_PQ type and specify parameters such as the number of subvectors M and the number of codewords K per subspace (usually K=256, corresponding to 8 bits). Choosing appropriate parameters requires a trade-off between compression rate, memory usage, search speed, and accuracy.

6. Large Language Models (LLMs)

Core idea: Use ultra-large-scale neural networks (usually Transformer architecture) and massive text data for pre-training to learn the statistical laws, grammatical structure, semantic relations and even certain world knowledge and reasoning ability of the language, so that it can understand and generate natural language.

Large Language Model (LLM):

Content Generation:
Write articles, emails, code, poetry, advertising copy, etc.
Question and answer and dialogue: intelligent customer service, chatbots, virtual assistants.
Text Summarization and Translation.
Sentiment analysis and text classification.
Generator in RAG: Generates answers based on the retrieved context.

Application principle:
Self-supervised learning based on the Transformer architecture. The model learns language representation by predicting the next word in the text (Causal LM, such as GPT) or masked words (Masked LM, such as BERT) during the pre-training phase. The huge number of model parameters and training data enables it to capture complex language phenomena.

Application method:
Call pre-trained models through APIs (such as OpenAI API, Google Gemini API), or download open source models (such as Llama, Mistral) for local deployment and use/fine-tuning. Prompt Engineering is required to guide the model to produce the desired output.

Finetuning:

Specific area questions and answers:
Fine-tuned for specialized fields such as medicine, law, and finance.
Specific style or role-playing: Have the model imitate the speaking style of a specific person.
Improving performance on specific tasks: e.g. fine-tuning on specific types of summarization or translation tasks.
Instruction following: Through instruction tuning, the model can better understand and execute user instructions.

Application principle:
Based on the pre-trained general LLM, additional training is performed using labeled datasets for specific tasks or domains. This allows the model to adjust its parameters to better adapt to the characteristics of the target task and data distribution, while retaining the general language capabilities learned in the pre-training phase.

Application method:
Prepare a labeled dataset (input-output pairs) for the target task, select a suitable pre-trained model, and fine-tune it using a specialized training framework (such as the Trainer API of the Hugging Face transformers library ). You need to adjust hyperparameters such as the learning rate and number of training rounds. The fine-tuned model can be deployed and used.

Multi-modal:

Application principle:
Refers to the ability of an AI system to process and integrate inputs from multiple senses (text, images, audio, video, etc.). Multimodal LLMs typically have encoders that can process inputs from different modalities and fuse these representations into a common space, which is then processed and generated by the core language model.
Application scenarios: visual question answering, image description, text-to-image generation, multimodal dialogue systems, multimodal RAG.
Application method: Use LLM APIs that support multimodal input (such as GPT-4V, Gemini Pro Vision) or deploy corresponding open source models. When inputting, you need to provide data of different modalities (for example, text and image URLs or Base64 encoding) as required by the model.

Embedded model (LLM context):

Application principle:
Inside LLM, the input text is first tokenized, and then each token is converted into a vector embedding. These initial embeddings (usually including word embeddings, position embeddings, paragraph embeddings, etc.) are the starting point for LLM to process sequence information, and subsequent Transformer layers will continuously transform and enrich these embedding representations. At the same time, independent embedding models (such as Sentence-BERT) are often used in the retrieval stage of RAG to specifically generate text representations for similarity search.
Application scenarios: An essential part of LLM internal processing flow; the retrieval component of the RAG system.
Application method: For LLM internal embedding, it is an inherent part of the model. For RAG retrieval, a separate, high-quality text embedding model needs to be selected and used.

Block (LLM context):

Application principle:
Since the Transformer architecture of LLM usually has a fixed context window size limit (such as 4k, 8k, 32k, 128k tokens), it cannot process very long documents at once. Chunking is the process of splitting long documents into segments smaller than the context window limit so that the most relevant chunks can be processed chunk by chunk or retrieved in RAG.
Application scenarios: Processing documents that exceed the LLM context length; preparing knowledge base data for the RAG system.
Application method: Choose a suitable block strategy (fixed size, recursive, semantic, etc.), and use the block tools provided by frameworks such as LangChain, LlamaIndex, etc. Parameters such as block size and overlap need to be considered to balance information integrity and processing efficiency.

Generative AI:

Application principle:
Different from discriminative AI (such as classification and regression), the goal of generative AI is to learn the underlying distribution of training data and to sample and generate new data that is similar to but different from the training data. LLM generates text by learning the probability distribution of language, and diffusion models generate images by gradually removing noise.
Application scenarios: content creation, art design, data enhancement, virtual world construction, drug discovery, etc. LLM is an important branch.
Application method: Use the corresponding generation model (LLM, Stable Diffusion, Midjourney, etc.) to guide the generation process by providing prompts or conditions.

Transformer Model:

Application principle:
At its core is the self-attention mechanism, which allows the model to dynamically calculate the importance (attention weight) of all other elements in the sequence to the current element when processing each element in the sequence, and aggregate information accordingly. This enables the model to effectively capture long-distance dependencies and is easy to calculate in parallel. Multi-Head Attention allows the model to focus on information from different representation subspaces. Combined with Positional Encoding to process sequence order. Encoder-Decoder structure or only encoder/decoder structure is widely used.
Application scenarios: The infrastructure of almost all modern top LLMs; machine translation, text summarization, question-answering systems, and many other NLP tasks.
Application method: It is integrated into various LLM and NLP models as the underlying architecture. Researchers and developers usually use pre-trained models based on Transformer and adapt them to specific tasks through fine-tuning.

7. Information Retrieval/Search

Core idea: Find relevant and useful information (usually documents or data fragments) from large-scale data sets based on the user's information needs (usually expressed as queries).

Reranking:

Application principle:
The idea of phased retrieval. The first phase (recall/retrieval) uses fast but relatively rough methods (such as BM25, ANN vector search) to filter out candidate sets (such as Top K=100) from massive data. The second phase (reranking) applies a more complex, more accurate but computationally more expensive model (such as the Transformer-based cross encoder) to re-score and rank this smaller candidate set to obtain the final result (such as Top N=10). The cross encoder inputs the query and each candidate document into the model at the same time, which can make a deeper relevance judgment.
Application scenarios: Improve the accuracy and relevance of the final results of search engines, recommendation systems, and RAG systems.
Application method: After the initial retrieval, the query and candidate document pairs are fed into a re-ranking model (such as the Cross-Encoder model in the sentence-transformers library) to re-rank the candidate set according to the relevance scores output by the model.

Retrieval Enhanced Generation (RAG):

Enterprise Intelligence Q&A:
Answer employee or customer questions based on your internal document repository.
Knowledge assistants in specific fields: such as medical and legal consulting assistance.
Question-answering system that requires citing sources or up-to-date information: Makes up for LLM knowledge cutoff and illusion questions.

Application principle:
It combines the advantages of information retrieval (finding relevant information from a knowledge base) and natural language generation (generating fluent and coherent answers using LLM). Basic process: Receive user query -> Use the query (after embedding) to retrieve relevant document fragments (context) in the vector database (or other knowledge base) -> Combine the original query and the retrieved context into an extended prompt -> Input this prompt into LLM -> LLM generates the final answer based on the provided context.

Application method:

Building a knowledge base:
Prepare documents, segment them, embed them, and store them in a vector database.
Implement the retriever: Select the embedding model and vector database to implement the function of retrieving the Top K relevant blocks based on the query.
Implement the generator: select LLM, design an appropriate prompt template, integrate the query and retrieved context, and input it into LLM.
Integration and Optimization: Use frameworks such as LangChain, LlamaIndex to simplify the building process, and apply various advanced RAG techniques for optimization.

8. Embedding Types

Core idea: Develop vector embedding representation methods with different characteristics to meet different needs (semantic capture, keyword matching, storage efficiency, representation granularity).

Variable Dimensions:

Resource-constrained devices:
Use low-dimensional prefixes for inference on edge devices or in memory-constrained environments.
Multi-stage search:
First use low-dimensional embedding for fast rough screening, and then use high-dimensional embedding for precise sorting.
Adaptive Search:
Select an appropriate dimension based on the complexity of the query or the accuracy requirement.

Application principle:
Allows the dimension of the embedding vector to be dynamically adjusted as needed. For example, the embedding trained by Matryoshka Representation Learning (MRL) has a prefix of the vector (such as the first 64 dimensions, the first 128 dimensions) that is itself an effective, low-dimensional embedding. This provides a flexible trade-off between computational resources/latency and representation accuracy.

Application method:
Use pre-trained models that support variable dimensions (such as some models trained based on MRL). When storing, you can store the complete embedding, and truncate prefixes of different lengths as needed during queries or downstream tasks.

Sparse Embeddings:

Searches that require an exact match of a term, ID, or code.
The keyword search part in hybrid search,
Complementary to dense embeddings.
Fields such as law and medicine that require precise terminology retrieval.

Application principle:
It mainly captures information at the vocabulary level. The vector dimension is equal to the vocabulary size. Only the dimension corresponding to the word that appears in the document has a non-zero value (usually the weight of the word, such as the BM25 score or the learned weight such as SPLADE). It is good at exact keyword matching.

Application method: Use TF-IDF, BM25 algorithm to calculate and generate; or use specially trained sparse representation models (such as SPLADE) to generate learned sparse vectors. Storage and query usually require specific data structures (such as inverted indexes).

Quantized Embeddings:

Very large-scale vector datasets:
When billions or more vectors need to be stored and retrieved, memory and storage costs are the main bottlenecks.
Scenarios that require low latency queries:
Quantized distance calculations are usually faster.
Memory-constrained environments.

Application principle:
By reducing the precision of the values in the vector (such as float32 -> int8 or binary) or using a codebook (such as PQ), the storage size of the vector is greatly compressed and the distance calculation is accelerated (such as int8 calculation is faster, and binary Hamming distance calculation is extremely fast). This is a lossy compression, which requires a trade-off between compression rate and accuracy loss.

Application method: When creating an index in a vector database (such as Milvus, Qdrant) or library (such as Faiss), select an index type that supports quantization (such as IVF_PQ , ScalarQuantizer , IndexBinaryFlat ). You need to configure quantization parameters (such as M and K for PQ, or the target number of bits int8/binary). The quantization process is usually completed during the index building phase.

Multi-vector Embeddings:

Passage Retrieval:
It often outperforms a single vector representation when exact matches to specific phrases or terms in a query are required.
Question answering systems that require fine-grained relevance judgment.
Search tasks that are sensitive to word order or exact wording.

Application principle:
Unlike compressing the entire text (such as sentences and paragraphs) into a single vector, the multi-vector method generates separate vectors for smaller units in the text (such as each word or tag). The "late interaction" strategy is adopted during retrieval, that is, the vector of each word in the query is compared with the vector of each word in the document, and these fine-grained similarity scores are aggregated to more accurately capture word matching and contextual relationships.

Application method: Use models that support multi-vector representation (such as ColBERT). Specialized index structure and query logic are required to support later interactive calculations. Compared with single-vector retrieval, the amount of calculation is usually greater.

Dense Embeddings:

Semantic search (core application):
Understand user intent and find conceptually related content.
NLP tasks such as text clustering, classification, and sentiment analysis.
Content representation in recommender systems.
The semantic retrieval part of the RAG system.

Application principle:
Captures the deep semantic meaning of text or data, rather than superficial vocabulary. The vector dimension is relatively low (hundreds to thousands of dimensions), and most elements are non-zero. It is learned by deep learning models on a large amount of data, and objects with similar meanings are close in distance in the vector space.

Application method: Use pre-trained dense embedding models (such as Sentence-BERT, Universal Sentence Encoder, OpenAI Ada embeddings, etc.) to generate vectors, store them in a vector database, and query them using metrics such as cosine similarity.

Binary Embeddings:

Extremely resource-constrained environments:
Such as mobile devices or embedded systems.
Very large datasets that are extremely sensitive to storage costs.
Scenarios where a certain loss of accuracy can be accepted in exchange for extremely high speed.

Application principle:
The extreme form of quantization compresses each dimension of the vector to only 1 bit (0 or 1). It is very memory efficient, and the distance between vectors can be calculated using the very fast Hamming distance (calculating the number of different values on corresponding bits) or Jaccard similarity.

Application method:
Use a model or technique that specifically generates binary embeddings. Need to use an index that supports binary vectors and Hamming distance (such as Faiss's IndexBinaryFlat or LSH-based binary indexing). The precision loss is usually more noticeable than int8 or PQ quantization.

9. Chunking Techniques

The core idea is to split large documents or data streams into smaller, more easily processed units (chunks) to adapt to model processing limitations (such as context windows) and optimize retrieval efficiency and relevance.

Semantic Chunking:

Application principle:
Try to divide the boundaries based on the semantic content of the text, with the goal of each chunk containing a relatively complete and coherent unit of thought or topic. This is usually achieved by analyzing the similarity of the embedding vectors between sentences or paragraphs, and splitting where the similarity drops significantly.
Application scenarios:
When dealing with texts with strong narrative and close logical connections between paragraphs (such as articles and reports), we hope that the context retrieved by RAG is as semantically complete as possible.
Application method:
Use NLP libraries (such as spaCy) to perform sentence segmentation, calculate the embedding similarity of adjacent sentences (or sentence groups), set a similarity threshold or detect mutation points to determine the chunk boundaries. Some frameworks (such as LlamaIndex) provide experimental semantic chunkers.

Recursive Chunking:

Application principle:
Takes a set of delimiters arranged in priority order (e.g. \n\n , \n , (space), ., etc.), and tries to split the text with the highest priority delimiter first. If the split block is still too large, recursively splits the block with the next priority delimiter until the block size meets the requirement or no lower priority delimiter is available. Try to respect the natural structure of the text as much as possible while controlling the size.
Application scenarios:
It can process texts that have certain structures (such as paragraphs and line breaks) but are not completely regular. It can also provide a relatively robust way to segment texts when the specific structure of the text is unknown.
Application method:
This is the default or recommended method commonly used in frameworks such as LangChain and LlamaIndex. You need to specify the chunk size ( chunk_size ), chunk overlap ( chunk_overlap ), and a list of delimiters.

LLM-based chunking:

Application principle:
Use the LLM's ability to understand the text content to guide the chunking process. You can use carefully designed prompts to let the LLM directly divide the text into semantically coherent chunks, or identify logical chapters or major topics in the text as the basis for chunking.
Application scenarios:
Processing documents with complex structures that require deep understanding to be segmented reasonably (such as legal contracts and research papers); scenarios that pursue the highest quality semantic blocks.
Application method:
Design a suitable prompt and call the LLM API for chunking. This is usually slower and more costly than algorithm chunking, and the benefits need to be carefully evaluated.

Fixed size chunks:

Application principle:
The simplest method is to split the text according to a fixed number of characters or tokens. Usually an overlap is set so that the end of each block is the same as the beginning of the next block to reduce the risk of information being hard cut off at the block boundary.
Application scenarios:
It processes plain text with no structure or unreliable structure information; it is simple and fast to implement and serves as a baseline method.
Application method:
Directly cut by length, set chunk_size and chunk_overlap parameters. Pay attention to the differences in characters and words in different languages.

Chunking based on document structure:

Application principle:
Use the document's inherent structural markers (such as HTML tags, Markdown headings, JSON keys, PDF bookmarks, or paragraph formatting) as chunk boundaries.
Application scenarios:
Process well-formatted and clearly structured documents, such as web pages (HTML), Markdown files, code files, and structured PDF reports. The logical hierarchy of the original text can be preserved to the greatest extent.
Application method:
Use the corresponding document parsing library (such as BeautifulSoup for HTML, PyMuPDF for PDF) to extract the structural information and the corresponding content blocks. For example, you can divide the content into blocks according to HTML tags such as <p> , <div> , <h1> , etc. or Markdown heading levels.

10. Advanced RAG Techniques

Core idea: Optimize each link of the basic RAG process (query understanding, data preparation, retrieval, generation) to improve the accuracy, relevance, reliability and efficiency of the final answer.

Reasoning and Action (ReAct):

Application principle:
LLM is not only passively generating answers based on the retrieved context, but also actively planning and executing a series of “thinking-acting-observing” steps to solve the problem. Actions can include calling external tools such as search engines, calculators, database query APIs, etc.
Application scenarios:
Handle complex queries that require multi-step reasoning, require external real-time information, or require interaction with external systems to answer. For example, "Find the latest stock price of company X and compare it to company Y."
Application method:
Use a framework that supports the ReAct mode (such as LangChain Agents) or implement the prompt loop yourself. You need to provide the LLM with the available tool set and its description, and design prompts that can guide it to perform ReAct reasoning.

Thinking Tree (ToT):

Application principle:
Allow LLMs to explore multiple branches of possibility during reasoning. For each step, the LLM can generate multiple "ideas" (possible next steps or solutions), then evaluate these ideas (perhaps through self-reflection or calling an external evaluator), and choose the most promising branch to continue exploring, even backtracking.
Application scenarios:
Solve complex problems or generative tasks that require exploratory thinking, do not have a single correct path, or require evaluating multiple solutions. In RAG, this can be used to explore different interpretations of a query or evaluate combinations of multiple retrieved context fragments.
Application method:
Implement complex prompting strategies or use specialized agent frameworks to manage state, generate, and evaluate multiple concurrent thought paths. Computationally expensive.

Chain of Thought (CoT):

Application principle:
Prompts such as “think step by step” guide the LLM to output detailed reasoning steps before giving the final answer. This explicit, decomposed thinking process helps the LLM understand the problem more accurately, use contextual information, and reduce reasoning errors.
Application scenarios:
Improve LLM's performance on tasks that require logical reasoning, mathematical calculations, or complex instruction following; in RAG, help LLM better integrate and utilize retrieved contextual information to construct answers.
Application method:
Include instructions to guide the LLM to think step by step in the final generation prompt. You can also use "Zero-shot CoT" (just add instructions) or "Few-shot CoT" (provide several examples with reasoning steps in the prompt).

Data cleaning:

Application principle:
The quality of RAG's knowledge base directly affects the retrieval and generation results. "Garbage in, garbage out." Cleaning aims to remove noise, errors, and irrelevant information to improve data quality.
Application scenarios:
The data preparation phase of the RAG system, especially when the source data come from multiple sources and have varying quality.
Application method:
Use regular expressions, scripts or specialized libraries to remove HTML tags, advertisements, headers and footers, repeated spaces; perform spelling correction, standardize date formats, handle missing values, etc.

Data extraction and analysis:

Application principle:
Accurately extract processable text content and metadata from various raw formats (PDF, DOCX, HTML, PPT, etc.) and possibly parse their structure.
Application scenarios:
The first step in building the RAG knowledge base is to process source files that are not plain text.
Application method:
Use Python libraries such as PyPDF2 , python-docx , BeautifulSoup , unstructured.io , etc. For text in scanned PDFs or images, you need to use OCR tools (such as Tesseract, PaddleOCR).

Data conversion:

Application principle:
Convert the extracted and cleaned data into a unified and regular format (such as plain text, Markdown, structured JSON) to facilitate subsequent segmentation, embedding, and LLM processing.
Application scenarios:
Ensure consistency of data format within the RAG knowledge base.
Application method:
Write conversion scripts as needed, for example, converting extracted tables to Markdown format, or unifying all text into UTF-8 encoding.

Embedding model fine-tuning:

Application principle:
A general-purpose embedding model may not be able to understand the subtle differences between terms or concepts in a specific domain well. By fine-tuning on (query, related documents) pairs in the target domain, the embedding model can better capture the semantic relationships in that domain, thereby improving retrieval accuracy.
Application scenarios:
RAG systems in professional fields (such as medicine, law, finance, and scientific research) have extremely high requirements for retrieval relevance.
Application method:
Prepare a dataset of domain-related queries and corresponding relevant/irrelevant documents, and further train the pre-trained embedding model using libraries such as sentence-transformers (usually using contrastive loss, triple loss, etc.).

Distance Thresholding:

Application principle:
The results returned by vector search are sorted by similarity (or distance). By setting a threshold and retaining only the results with a similarity higher than the threshold (or a distance lower than the threshold), you can filter out those "noise" results that are relatively close in the vector space but may not actually be relevant to the query.
Application scenarios:
Improve the signal-to-noise ratio of the context fed into the LLM and avoid irrelevant information interfering with the generation.
Application method:
After receiving the Top K results returned by the vector database, the similarity score of each result is checked, and only the results whose scores exceed the preset threshold are passed to the generation stage. The threshold needs to be adjusted based on experience or experiments.

Tips for Engineering:

Application principle:
The output quality of LLM is highly dependent on the quality of the input prompts. In RAG, how to organize the original query, how to present the retrieved context, and what kind of instructions to give to guide LLM to generate answers based on this information are all part of prompt engineering.
Application scenarios:
The generation stage of RAG is the key link in determining the quality of the final answer.
Application method:
Design clear, specific prompt templates. Experiment with different ways of presenting context (e.g., numbered lists, separators), different wording of instructions (e.g., “Answer the question based on the following information,” “Summarize the following information”), whether to require citing sources, etc.

Context Compression:

summary:
Each retrieved block is digested using another LLM.
Selective Extraction:
Based on the overlap or relevance score with the query, only the key sentences in the chunk are extracted.
filter:
Remove duplicate or highly similar blocks.
Use a specialized context compression tool such as LongLLMLingua.

Application principle:
The retrieved context may be large, exceeding the LLM context window limit, or contain redundant information. Compression aims to reduce the length of the context while retaining the most important information to fit within the window limit, reduce cost, and improve the focus of the LLM.
Application scenarios:
When the retrieved context is too long or contains much redundancy.

Metadata filtering:

Application principle:
Leverage metadata associated with vectors stored in the vector database (such as document source, creation date, author, category tags, etc.) to filter before or while searching for vectors. This can greatly narrow the search scope and improve efficiency and relevance.
Application scenarios:
When a user query implies a requirement for specific metadata (such as "find reports on Company A from the past week"); or when the RAG system needs to retrieve information from a specific subset (such as documents for a certain department).
Application method:
When calling the vector database query interface, you can also pass in the metadata filtering conditions. Most modern vector databases support this feature.

Query route:

Application principle:
For complex RAG systems with multiple knowledge base indexes (which may correspond to different domains, different data types, or different update frequencies), query routing aims to intelligently direct user queries to the most appropriate index or processing flow based on the intent or content of the query.
Application scenarios:
Large-scale enterprise-level RAG systems; RAG applications that need to integrate different data sources (such as document libraries, databases, APIs).
Application method:
Add a query classification step at the beginning of the RAG process (you can use simple rules, keyword matching, or LLM for intent recognition), and select the index and prompt template for subsequent use based on the classification results.

Data preprocessing:

Application principle:
This is the foundational work for building a high-quality RAG knowledge base, covering all preparation steps from raw data to clean, well-organized text blocks ready for embedding.
Application scenarios:
The initial phase of any RAG system build.
Application method:
Comprehensively use data cleaning, extraction, analysis, conversion, segmentation and other technologies to form a complete data processing pipeline.

Autocut:

Application principle:
(Speculatively) An automated method to further prune or filter out contextual information that is judged to be less relevant to the core intent of the query after retrieval and before generation, even if they are close in vector space. It may be based on more complex rules or models to judge the actual contribution of information.
Application scenarios:
Further purify the context and reduce the burden of LLM in processing irrelevant information.
Application method:
It may be a specific library, algorithm or service that implements this functionality. The exact method is unknown, but the goal is to optimize the signal-to-noise ratio of the context.

Query Rewrite:

Application principle:
The user's original query may be vague, have typos, be too complex, or not in line with the expression habits of the knowledge base. Query rewriting uses LLM or other techniques to optimize, clarify, or decompose the original query to make it more suitable for subsequent retrieval (especially vector retrieval) and generation.
Application scenarios:
Improved handling of unstructured, natural language queries; handling of complex or multi-intent queries.
Application method:
Before searching, the user query is input into an LLM and it is prompted to rewrite, correct, clarify or decompose (such as HyDE - Hypothetical Document Embeddings, generating a hypothetical answer and then searching its embedding; or Step-back Prompting, generating a more general question before searching).

Query expansion:

Based on dictionary:
Extended with synonym dictionaries such as WordNet.
Based on LLM:
Let LLM generate words or phrases related to your query.
Embedded Space Extension:
Find other word vectors near the query vector for expansion.
Add the expanded terms to the original query for retrieval (may need to adjust the weights).

Application principle:
To improve recall, especially when the user query is short or does not contain all relevant terms, synonyms, related concepts, or terms can be automatically added to the original query.
Application scenarios:
When there is concern that the initial query may not retrieve all relevant documents (recall first).