Images can also be added to the knowledge base through RAG

Written by

Jasper Cole

Updated on:June-24th-2025

We know that Retrieval-augmented Generation (RAG) effectively alleviates the knowledge limitations of large models in professional fields by integrating external knowledge bases with generative models. Traditional knowledge bases are mainly text-based and usually rely on plain text embedding to achieve semantic search and content retrieval.

However, with the growing demand for multimodal data and the increase in complex document processing scenarios, traditional methods often face performance bottlenecks when processing mixed-format documents (such as PDFs containing text, images, and tables) or long context content.Cohere Embed v4 The emergence of provides innovative solutions to these challenges. Its multimodal embedding capability and long context support significantly improve the performance and applicability of RAG systems.

Cohere Embed v4It is a multimodal embedding model that can meet the needs of enterprises . It was released on April 15, 2025. It can handle text, images, and mixed formats (such as PDF), which is very suitable for scenarios that need to process complex documents. Its key features are as follows,

Multimodal support: Documents containing text and images, such as PDFs and presentation slides, can be embedded uniformly.
Long context: Supports context length up to 128K, about 200 pages, suitable for long documents.
Multilingual capability: Covers more than 100 languages and supports cross-language search without the need to identify or translate languages.
Security and efficiency: Optimized for industries such as finance and healthcare, it can be deployed in a virtual private cloud or locally, and provides compression embedding, saving up to 83% of storage costs.

Next, let's test this Cohere Embed v4As an embedded model, it needs to work with a large model, such as Gemini Flash 2.5.

First, let’s take a look atCohere Embed v4 and Gemini Flash 2.5 What are the relationships and how exactly do they collaborate in this task?

We want to implement a vision-based Retrieval-Augmented Generation (RAG) system. In this system,Cohere Embed v4 and Gemini Flash 2.5 Playing different roles, they work together to complete the task:

Cohere Embed v4 is responsible for the retrieval part. It converts images and text into vector representations (embeddings) and then uses these embeddings to search for images that are most relevant to the user's question.
Gemini Flash 2.5 is responsible for the generation part. It is a powerful visual language model (VLM) that can understand images and text and generate answers based on them.

How do they work together to accomplish the task? Here is the process of their collaboration:

Image embedding: First, use Cohere Embed v4 Encode all images, generate image embeddings, and store them.
Question embedding: When a user asks a question,Cohere Embed v4 Questions are also encoded into embeddings.
Retrieval: The system compares the question embedding with the image embeddings to find the images most relevant to the question.
Answer generation: Send the retrieved image and the user's question to Gemini Flash 2.5, it generates the final answer based on the image and the question.

summary

in short,Cohere Embed v4 Acts as an information retriever to find images relevant to the question, while Gemini Flash 2.5 Acts as an answer generator to generate answers based on the retrieved images and questions. They work together to implement a vision-based RAG system that allows users to obtain information from images by asking questions in natural language.

The experimental code we provide below is mainly to provide an idea for reference when actually building a knowledge base using images or PDFs.

Experiment Code

The following code demonstrates a purely visual RAG approach that works even for complex infographics. It consists of two parts:

Cohere's state-of-the-art text and image retrieval model Embed v4. It allows us to embed and search complex images such as infographics without any preprocessing.
Vision-LLM: We use Google's Gemini Flash 2.5. It allows input of images and text questions and is able to answer questions based on them.

First, let's take a look at the question and answer example after it is built.

Code,

# Define the query queryquestion = "Please explain the picture of geese in Chinese" # Search for the most relevant images top_image_path = search(question) # Answer the query using the searched images answer(question, top_image_path)

Answers based on the searched images are as follows,

This answer is good, it turns out that the image is upside down. Cohere is credited for finding the image in the library according to the question, and Gemini is credited for interpreting the image.

Let’s try another one.

# Define the query queryquestion = "I remember there was a picture with a cat in it. Please explain what that picture is about?" # Search for the most relevant images top_image_path = search(question) # Answer the query using the searched images answer(question, top_image_path)

The answer is as follows,

The following is the installation and specific code.

Visit cohere.com, register and get an API key.

pip install -q cohere

# Create the Cohere API client. Get your API key from cohere.comimport coherecohere_api_key = "<<YOUR_COHERE_KEY>>" #Replace with your Cohere API keyco = cohere.ClientV2(api_key=cohere_api_key)

Go to Google AI Studio to generate an API key for Gemini. Then, install the Google Generative AI SDK.

pip install -q google-genai

from google import genaigemini_api_key = "<<YOUR_GEMINI_KEY>>" #Replace with your Gemini API keyclient = genai.Client(api_key=gemini_api_key)

import  requestsimport  osimport  ioimport  base64import  PILimport  tqdmimport  timeimport  numpy  as  np
# Some helper functions to resize images and to convert them to base64 formatmax_pixels =  1568 * 1568   #Max resolution for images
# Resize too large imagesdef  resize_image ( pil_image ):    org_width, org_height = pil_image.size
    # Resize image if too large    if  org_width * org_height > max_pixels:        scale_factor = (max_pixels / (org_width * org_height)) **  0.5        new_width =  int (org_width * scale_factor)        new_height =  int (org_height * scale_factor)        pil_image.thumbnail((new_width, new_height))
# Convert images to a base64 string before sending it to the APIdef  base64_from_image ( img_path ):    pil_image = PIL.Image. open (img_path)    img_format = pil_image. format  if  pil_image. format  else  "PNG"
    resize_image(pil_image)
    with  io.BytesIO()  as  img_buffer:        pil_image.save(img_buffer,  format =img_format)        img_buffer.seek( 0 )        img_data =  f"data:image/ {img_format.lower()} ;base64," +base64.b64encode(img_buffer.read()).decode( "utf-8" )
    return  img_data
# List of images, both local and network.images = {    "test1.webp" :  "./img/test1.webp" ,    "test2.webp" :  "./img/test2.webp" ,    "test3.webp" :  "./img/test3.webp" ,    "tesla.png" :  "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fs ubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbef936e6-3efa-43b3-88d7-7ec620cdb33b_2744x1539.png" ,    "netflix.png" :  "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fs ubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23bd84c9-5b62-4526-b467-3088e27e4193_2744x1539.png" ,    "nike.png" :  "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fs ubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5cd33ba-ae1a-42a8-a254-d85e690d9870_2741x1541.png" ,    "google.png" :  "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack -post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F395dd3b9-b38e-4d1f-91bc-d37b642ee920_2741x1541.png" ,    "accenture.png" :  "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fs ubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08b2227c-7dc8-49f7-b3c5-13cab5443ba6_2741x1541.png" ,    "tecent.png" :  "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fs ubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ec8448c-c4d1-4aab-a8e9-2ddebe0c95fd_2741x1541.png"}
# Download images and calculate embedding for each imageimg_folder =  "img"os.makedirs(img_folder, exist_ok= True )
img_paths = []doc_embeddings = []for  name, url  in  tqdm.tqdm(images.items()):    img_path = os.path.join(img_folder, name)    img_paths.append(img_path)
    # Download the image    if  not  os.path.exists(img_path):        response = requests.get(url)        response.raise_for_status()
        with  open (img_path,  "wb" )  as  fOut:            fOut.write(response.content)
    # Get the base64 representation of the image    api_input_document = {        "content" : [            { "type" :  "image" ,  "image" : base64_from_image(img_path)},        ]    }
    # Call the Embed v4.0 model with the image information    api_response = co.embed(        model= "embed-v4.0" ,        input_type= "search_document" ,        embedding_types = [ "float" ],        inputs=[api_input_document],    )
    # Append the embedding to our doc_embeddings listemb = np.asarray (     api_response.embeddings.float [ 0 ])    doc_embeddings.append(emb)
doc_embeddings = np.vstack(doc_embeddings)print ( "\n\nEmbeddings shape:" , doc_embeddings.shape)

See these image embeds:Embeddings shape: (9, 1536).

The following shows a simple pipeline of vision-based RAG (retrieval-augmented generation).

First we execute search(): we compute an embedding for our question. We can then use that embedding to search through our library of pre-embedded images to find the most relevant image, and then return that image.
In answer(), the question and the image are sent to Gemini to get the final answer to the question.

# Search allows us to find relevant images for a given question using Cohere Embed v4def  search ( question, max_img_size= 800 ):    # Compute the embedding for the query    api_response = co.embed(        model= "embed-v4.0" ,        input_type = "search_query" ,        embedding_types = [ "float" ],        texts=[question],    )
query_emb = np.asarray (     api_response.embeddings.float [ 0 ])
    # Compute cosine similarities    cos_sim_scores = np.dot(query_emb, doc_embeddings.T)
    # Get the most relevant image    top_idx = np.argmax(cos_sim_scores)
    # Show the images    print ( "Question:" , question)
    hit_img_path = img_paths[top_idx]
    print ( "Most relevant image:" , hit_img_path)    image = PIL.Image.open (hit_img_path )    max_size = (max_img_size, max_img_size)   # Adjust the size as needed    image.thumbnail(max_size)    display(image)    return  hit_img_path
# Answer the question based on the information from the image# Here we use Gemini 2.5 as powerful Vision-LLMdef  answer ( question, img_path ):    prompt = [ f"""Answer the question based on the following image.Don't use markdown.Please provide enough context for your answer.
Question:  {question} """, PIL.Image. open (img_path)]
    response = client.models.generate_content(        model= "gemini-2.5-flash-preview-04-17" ,        contents=prompt    )
    answer = response.text    print ( "LLM Answer:" , answer)

Then, questions are answered about the images.

# Define the queryquestion =  "Please explain Nike's data in Chinese"
# Search for the most relevant imagetop_image_path = search(question)
# Use the image to answer the queryanswer(question, top_image_path)

Here are the answers: