After doing the RAG image search, I realized that my previous understanding of RAG was not enough.

Written by
Iris Vance
Updated on:July-09th-2025
Recommendation

Image RAG technology is a new breakthrough in AI multimodal applications. This article comprehensively analyzes the principles and practices of image RAG technology, and takes you to quickly build a high-performance image RAG system.

Core content:
1. Overview of image RAG technology and its advantages in multimodal applications
2. Data preprocessing and feature extraction: key steps in image encoding from pixels to vectors
3. Practical guide and code examples, CLIP model application and image feature extraction

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

With the rapid development of AI technology, Image RAG (Retrieval-Augmented Generation) is gradually becoming the "killer weapon" of multimodal applications. Whether it is "searching products by image" on e-commerce platforms or "generating illustrations from text" in the field of education, Image RAG can bring amazing results through the efficient combination of retrieval and generation.

Have you ever encountered such a scenario:

  • When you upload a picture of a piece of clothing on an e-commerce platform, the system can not only find similar styles, but also automatically generate matching suggestions;

  • When you are studying a complex medical article, AI can intelligently match relevant medical images and even automatically generate explanations combining pictures and texts;

  • When a designer uploads a sketch, AI can retrieve images of related styles and generate creative sketches that are more in line with the design ideas.

Behind these magical functions lies the power of Image RAG. Through the method of retrieval + generation , it can not only ensure the accuracy of the results, but also improve the creativity of the content, allowing AI's understanding ability to leap to a new level. This article will provide you with detailed practical guides and complete code examples to help you get started quickly and build a high-performance Image RAG system!    


1. What is Image RAG?

Simply put, Image RAG is a technology that combines image retrieval with a generative model. Its core idea is to first retrieve the images or information most relevant to the user input from massive data, and then use these retrieval results as context, input them into the generative model, and output high-quality responses. In some scenarios, the most relevant images or information to the user input are directly retrieved from massive data. Compared with traditional single retrieval or generation technologies, Image RAG has the following advantages:


  • Accuracy : The search process ensures that the results are highly relevant to the user input.

  • Creativity : Generative models can further enrich the output content.

  • Multimodal : supports multiple input forms such as text and images.

Next, we will disassemble each link of the image RAG according to the technical process and provide detailed implementation steps and codes to ensure that you can practice directly.


2. Data preprocessing and feature extraction: Laying a solid foundation

1. Image Coding: From Pixels to Vectors

The first step of image RAG is to convert the image into a feature vector that the machine can understand. We recommend using the CLIP model (ViT-B/32) , which is a powerful tool developed by OpenAI that can map images and text into the same vector space, which is very suitable for multimodal tasks.

Tool Selection:

  • Model : CLIP (ViT-B/32), due to its excellent image-text alignment capabilities.

  • Framework : Hugging Face's transformers library, which is simple to use and has extensive community support.

Pre-preparation:

Make sure you have the necessary libraries installed:

pip install torch transformers pillow numpy

Code implementation:

The following is a complete image feature extraction example. Make sure you have a picture named image.jpg in your working directory:

from  PIL  import  Imagefrom  transformers  import  CLIPProcessor, CLIPModelimport  numpy  as  np
# Load the pre-trained CLIP model and processormodel = CLIPModel.from_pretrained( "openai/clip-vit-base-patch32" )processor = CLIPProcessor.from_pretrained( "openai/clip-vit-base-patch32" )
# Open and process the imageimage = Image.open("image.jpg"). convert ( " RGB " )   # Make sure the image is in RGB formatinputs = processor(images=image, return_tensors= "pt" , padding= True )   # Convert to PyTorch tensor
# Extract image featuresimage_features = model.get_image_features(**inputs).detach().numpy()
# Normalize feature vectorsimage_features = image_features / np.linalg.norm(image_features, axis= 1 , keepdims= True )
# Check the resultsprint ( "Image feature dimension: " , image_features.shape)   # should be (1, 512)print ( "Image feature example: " , image_features[ 0 ][: 5 ])   # View the first 5 values

Note :

  • Importance of normalization : Normalization unifies the length of feature vectors to 1 to avoid the influence of cosine similarity calculation in subsequent retrieval due to the difference in vector size.

  • Exception handling : If the image cannot be opened or the format is wrong, it is recommended to add a try-except block to ensure code robustness:

try: image = Image.open("image.jpg").convert("RGB")except Exception as e: print(f"Image loading failed: {e}") exit(1)

2. Text alignment: text and image integration

If your application scenario needs to support text queries (such as "red dress"), you need to convert the text into a vector and align it with the image features. The power of CLIP lies in its ability to process images and text at the same time.

Code implementation:

# Define text querytext  =  "a red dress"text_inputs  = processor(text=text, return_tensors= "pt" , padding=True)
# Extract text featurestext_features  = model.get_text_features(**text_inputs).detach().numpy()
# Normalize text featurestext_features  = text_features / np.linalg.norm(text_features, axis= 1 , keepdims=True)
# Check the resultsprint ( "Text feature dimension: " , text_features.shape) # should be ( 1512 )print ( "Text feature example: " , text_features[ 0 ][: 5 ]) # View the first 5 values

Note:

  • Ensure that the text input is concise and avoid long sentences that affect feature quality.

  • If you need multilingual support, you can try the multilingual variant of CLIP (such as openai/clip-vit-base-patch32-multilingual).

3. Metadata association: a bridge between business and technology

After feature extraction, you need to associate the feature vector with business data (such as product ID and price). We recommend using Pandas to save the data in Parquet format, which is both efficient and saves disk space.

Pre-preparation:

Install Pandas:

pip install pandas pyarrow

Code implementation:

import  pandas  as  pd
# Construct metadatametadata = {    "image_id" : [ "img_001" ],    "feature" : [image_features.tobytes()],   # Convert to binary storage    "category" : [ "dress" ],    "price" : [ 299 ]}
# Create a DataFrame and save it as Parquetdf = pd.DataFrame(metadata)df.to_parquet( "image_metadata.parquet" , engine= "pyarrow" )
# Verify the saved resultprint ( "Metadata has been saved to image_metadata.parquet" )df_loaded = pd.read_parquet( "image_metadata.parquet" )print (df_loaded)

Note :

  • Binary storage : tobytes() converts NumPy arrays to binary, saving space and facilitating database storage.

  • Batch processing : If you have a large number of images, it is recommended to process them in a loop and append them to a DataFrame.



3. Index building: making retrieval as fast as lightning

1. Vector indexing: Faiss comes to the rescue

With the feature vector, the next step is to build an index for fast retrieval. We recommend using Faiss , which is Facebook's open source dense vector retrieval library that supports GPU acceleration and is extremely efficient.

Pre-preparation:

Install Faiss (CPU version is used as an example, GPU version needs to be compiled):

pip install faiss-cpu

Assuming you have extracted features from multiple images, save them as all_image_features.npy (shape is (n, 512) where n is the number of images):

# Example: Generate simulated features import numpy as np np.random.seed(42) all_image_features = np.random.randn(1000, 512).astype(np.float32) # 1000 images np.save("all_image_features.npy", all_image_features)

Basic index:

import faiss
# Define feature dimensionsdim =  512   # CLIP feature dimension
# Create inner product similarity indexindex  = faiss.IndexFlatIP(dim)
# Load features and add to indexfeatures = np.load( "all_image_features.npy" )index.add(features)
# Save indexfaiss.write_index( index"image_index.faiss" )
# Verify indexprint ( "Total number of vectors in index: " , index.ntotal)   # should be 1000

Optimizing indexes: IVFFlat speeds up

For millions of data, IndexFlatIP search speed is slow. It is recommended to use IVFFlat (inverted file index) to reduce the search scope through clustering.

# Define the number of cluster centersnlist = 100   # Adjust according to the amount of data, sqrt(n) is recommendedquantizer = faiss.IndexFlatIP(dim)
# Create IVFFlat indexindex = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
# Training indexindex.train(features)
# Add featuresindex.add(features)
# Save indexfaiss.write_index(index,  "image_index_ivf.faiss" )
# Set the search scope (optional)index.nprobe = 10   # Search for 10 cluster centers to balance speed and accuracy

Note :

  • nlist selection : The larger the amount of data, the larger nlist should be, but too large a value will increase training time.

  • GPU acceleration : If you have a GPU, you can install faiss-gpu and just replace IndexFlatIP with GpuIndexFlatIP.

2. Metadata storage: SQLite comes into play

When searching, in addition to the vector, business information must also be returned. We use SQLite to store image IDs, features, and metadata.

Pre-preparation:

SQLite does not require additional installation, Python comes with built-in support.

Code implementation:

import  sqlite3
# Connect to the databaseconn = sqlite3.connect( "image_rag.db" )
# Create tableconn.execute( '''CREATE TABLE IF NOT EXISTS images                (id TEXT PRIMARY KEY, feature BLOB, category TEXT, price INT)''')
# Insert sample dataconn.execute( "INSERT OR REPLACE INTO images VALUES (?, ?, ?, ?)" ,             ( "img_001" , image_features.tobytes(),  "dress"299 ))
# Commit changesconn.commit()
# Verify datacursor = conn.cursor()cursor.execute( "SELECT * FROM images WHERE id='img_001'" )print ( "Query results: " , cursor.fetchone())conn.close()

Note:

  • Primary key uniqueness : id is set as TEXT PRIMARY KEY to avoid duplicate insertion.

  • Batch insert : If the amount of data is large, executemany can be used to improve efficiency.


4. Search phase: Find the best matching content

1. Handling User Input

The user may enter text (such as "blue shirt") or an image, which we need to convert into a vector.

Text query:

text = "a blue shirt"text_inputs = processor(text=text, return_tensors="pt", padding=True)text_features = model.get_text_features(**text_inputs).detach().numpy()text_features = text_features / np.linalg.norm(text_features, axis=1, keepdims=True)

Image query:

query_image = Image.open("query.jpg").convert("RGB")query_inputs = processor(images=query_image, return_tensors="pt", padding=True)query_features = model.get_image_features(**query_inputs).detach().numpy()query_features = query_features / np.linalg.norm(query_features, axis=1, keepdims=True)

2. Perform a search

Use Faiss to retrieve the Top-K results and then pull metadata from SQLite.

Code implementation:

# Load the indexindex = faiss.read_index( "image_index_ivf.faiss" )
# Retrieve Top-5 resultsk = 5D, I = index.search(query_features, k)   # D is the distance, I is the index
# Connect to the databaseconn = sqlite3.connect( "image_rag.db" )cursor = conn.cursor()
# Get metadataresults = []for idx in I[0]:    cursor.execute( "SELECT id, category, price FROM images WHERE rowid=?" , (idx + 1,))    results.append(cursor.fetchone())
conn.close()
# Output resultsprint("Search results:")for result in results:    print(f "ID: {result[0]}, Category: {result[1]}, Price: {result[2]}" )

Note:

  • Rowid offset : Faiss index starts at 0, while SQLite rowid starts at 1, so idx + 1 is required.

  • Exception handling : If the index or database is empty, add a check.

3. Reorder (optional)

If higher accuracy is required, a cross encoder can be used to fine-tune the results.

Pre-preparation:

pip install sentence-transformers

Code implementation:

from sentence_transformers import CrossEncoder
# Load the cross encoderreranker = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L-6-v2" )
# Assume the search results have description textresult_descriptions = [ "Red silk dress""Blue cotton shirt" ]query_text =  "a blue shirt"pairs = [(query_text, desc) for desc in result_descriptions]
# Calculate relevance scorescores = reranker.predict(pairs)
# Sort by scoresorted_indices = np.argsort(scores)[::-1]print("Index after reordering:", sorted_indices)


5. Generation phase: from retrieval to creative output

1. Multimodal prompt words

Combine user queries and retrieval results into prompt words and give them to the generation model.

Code implementation:

user_query = "a blue shirt" prompt = f"""User Query: {user_query}Retrieved Images: [img_001.jpg, img_002.jpg] (Categories: dress, shirt)Retrieved Text: "This blue shirt is made of cotton, priced at $49."Task: Generate a response explaining why these results are relevant."""


2. Calling the generated model

Since gpt-4 is not open source, we use EleutherAI/gpt-neo-1.3B as an example.

Pre-preparation:

pip install transformers

Code implementation:

from  transformers import pipeline
# Load the generated modelgenerator  = pipeline( "text-generation" , model= "EleutherAI/gpt-neo-1.3B" )
# Generate Responseresponse  = generator(prompt, max_length= 200 , num_return_sequences= 1 )print ( "Generated result: " , response [ 0 ][ "generated_text" ])

Note:

  • Model selection : If you need stronger results, you can use paid APIs (such as OpenAI GPT-4).

  • Length control : max_length is adjusted according to demand.

3. Output structure

Want to generate JSON format? Just specify it in the prompt.

Code implementation:

prompt += "\nFormat the response as JSON with keys: 'product_id', 'reason'." response = generator(prompt, max_length=200)print("Structured output:", response[0]["generated_text"])


6. Efficiency optimization: making the system faster and stronger

1. Model Optimization

Distillation Model:

Using CLIP Lite:

model = CLIPModel.from_pretrained("asus-uwk/distil-clip-vit-base-patch32")

Quantization acceleration:

from torch.quantization import quantize_dynamicmodel = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

2. Index optimization

Search by category:

cursor.execute("SELECT category FROM images WHERE id=?", ("img_001",))category = cursor.fetchone()[0]sub_index = faiss.read_index(f"indices/{category}.faiss")D, I = sub_index.search(query_features, k=5)

3. Caching strategy

Use Redis to cache high-frequency queries:

Pre-preparation:

pip install redis# Start the Redis server (Redis needs to be installed locally) redis-server

Code implementation:

import  redisimport  json
r = redis.Redis(host= "localhost" , port= 6379 , db= 0 )cache_key =  f"retrieval: { hash ( str (query_features))} "
if  r.exists(cache_key):    results = json.loads(r.get(cache_key))else :    D, I = index.search(query_features, k= 5 )    results = [{ "id" : i}  for  i  in  I[ 0 ]]   # Simplified example    r.setex(cache_key,  86400 , json.dumps(results))   # Cache for 24 hours
print ( "Cached results: " , results)


7. End-to-end case: e-commerce product search by image

1. Data Preparation

  • Image path : /tanjp/data/products/*.jpg

  • Metadata : product_id, image_path, category, price, description

2. Batch Feature Extraction

Save it as extract_features.py :

import  argparseimport  globfrom  PIL  import  Imageimport  numpy  as  np
parser = argparse. ArgumentParser ()parser. add_argument ( "--input_dir"default = "/tanjp/data/products" )parser. add_argument ( "--output"default = "features.npy" )args = parser.parse_args ( )
images = glob.glob ( f "{args.input_dir}/*.jpg" )features = []for  img_path  in  images :    image =  Image . open (img_path). convert ( "RGB" )    inputs =  processor (images=image, return_tensors= "pt" )    feat = model. get_image_features (**inputs). detach (). numpy ()    features.append (feat[ 0 ] )
np. save (args. output , np. array (features))

run:

python extract_features.py --input_dir /tanjp/data/products --output features.npy

3. Service deployment (FastAPI)

Pre-preparation:

pip install fastapi uvicorn

Code implementation:

from  fastapi  import  FastAPI, File, UploadFileimport  io
app = FastAPI()
@app.post( "/search" )async  def  search ( image: UploadFile = File( ... ) ):    img_bytes =  await  image.read()    query_image = Image.open (io.BytesIO(img_bytes)).convert( " RGB" )    inputs = processor(images=query_image, return_tensors= "pt" )    query_features = model.get_image_features(**inputs).detach().numpy()    query_features = query_features / np.linalg.norm(query_features, axis= 1 , keepdims= True )    D, I = index.search(query_features, k= 5 )    return  { "results" : [ int (i)  for  i  in  I[ 0 ]]}

run:

uvicorn main:app --host 0.0.0.0 --port 8000

4. Front-end call

<input  type = "file"  id= "imageInput"  accept= "image/*" >< button onclick = "search()" > Search </ button > < div id = "results" ></ div > 
< script >async  function  search () {    const  file =  document . getElementById ( "imageInput" ). files [ 0 ];    const  formData =  new  FormData ();    formData.append ( "image" , file) ;    const  response =  await  fetch ( "http://localhost:8000/search" , {        method"POST" ,        body : formData    });    const  data =  await  response. json ();    document . getElementById ( "results" ). innerText  =  JSON . stringify (data. results );}</ script >


8. Solutions to common problems

When building an image RAG system, we may encounter some performance bottlenecks, data management, index optimization and other issues. The following are some common issues and their solutions.  

1. The memory usage is too high. How to optimize it?

Problem Analysis :

  • When storing large numbers of high-dimensional feature vectors, memory requirements can skyrocket.

  • If all data is loaded during retrieval, it is easy to cause OOM (Out of Memory).

Solution :

  • Use LMDB for efficient storage : LMDB (Lightning Memory-Mapped Database) is a high-performance key-value storage database suitable for storing large-scale feature vectors.

import  lmdbimport  numpy  as  np
# Create LMDB databaseenv = lmdb. open ( "features.lmdb" , map_size= 10 ** 12 )   # 1TB space
# Store feature vectorswith  env.begin(write= Trueas  txn:    txn.put( "img_001" .encode(), np.random.rand( 512 ).astype(np.float32).tobytes())
  • Load indexes in batches : Avoid loading all data at once and load it on demand instead . 

2. The retrieval speed is too slow, how to speed it up?

Problem Analysis :

  • When using IndexFlatIP directly for vector retrieval, the search time will slow down as the amount of data increases.

 Solution :

  • Index optimization using IVFFlat :

import  faiss
# 512-dimensional features, 100 cluster centersquantizer  = faiss.IndexFlatIP( 512 )index  = faiss.IndexIVFFlat(quantizer,  512100 , faiss.METRIC_INNER_PRODUCT)index .train(np.random.rand( 10000512 ).astype(np.float32))index .add(np.random.rand( 10000512 ).astype(np.float32))
  • Use GPU acceleration : If the amount of data is particularly large, you can use the Faiss-GPU version to improve retrieval efficiency.  

res = faiss.StandardGpuResources()index = faiss.GpuIndexFlatIP(res, 512)

3. How to ensure the business interpretability of search results?

Problem Analysis :

  • Retrieval based purely on vector matching may produce results that are relevant but unreasonable .  

Solution :

  • Post-processing with metadata : After searching, perform a secondary screening of the results, for example:

    • Filter by category : Clothing search results shouldn’t include shoes.  

    • Filter by price : Don't show items that are out of your budget.  

results = [r for r in retrieved_results if r["category"] == "clothing"]


9. Application scenarios: from creativity to practicality

1. E-commerce: Search products by image + generate personalized recommendations

  • When a user uploads a photo of a piece of clothing , AI can:

  • Retrieve similar items and provide purchase links. 

  • Generate reasons for product recommendations, such as "This dress matches your style and is 20% off." 

2. Education: Intelligent illustration generation

  •  Input a biology textbook , AI can:

  • Retrieve relevant anatomical illustrations. 

  • Generate an illustration that meets teaching needs. 

3. Design Creativity: AI-Assisted Art Creation

  • ? A designer uploads a sketch , and AI can:

  • Retrieve reference images of similar style. 

  • Generate an advanced version of the design. 


10. Tips for improving efficiency

1. Vector search optimization

  • Use HNSW index : better suited for high-dimensional vectors than IVFFlat . 

  • Hierarchical indexing : Use IVFFlat for rough screening first, and then use IndexFlatIP for fine screening to speed up calculations.

2. Generative Model Acceleration

  • Use DistilGPT-2 instead of GPT-4 to reduce computational overhead. 

  • Quantized model : Use torch.quantization to make the model computation more efficient.  

from torch.quantization import quantize_dynamicimport torch.nn as nnmodel = quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

3. Business implementation optimization

  • Use Redis to cache high-frequency search results to reduce repeated calculations.

  • Update the index incrementally on a regular basis to keep the data fresh.


11. Summary

Through this article, you have mastered the complete technical route of image RAG . From data preprocessing to retrieval optimization , and then to generation enhancement , we provide a detailed practical guide so that you can quickly implement your own AI application.     

Whether you want to optimize e-commerce searches , enhance educational experiences , or empower creative design , Image RAG can help you significantly improve the intelligence and practicality of AI generation.