Chroma, an open source AI native vector database essential for RAG implementation

Explore the AI native vector database Chroma to help implement RAG technology and multimodal retrieval.
Core content:
1. The core concepts of Chroma database and its application advantages in RAG technology
2. Chroma installation and basic configuration methods
3. Chroma database addition, deletion, modification and query operation skills
1. Chroma Core Concepts and Advantages
1. What is Chroma?
Chroma is an open source vector database designed for efficient storage and retrieval of high-dimensional vector data. Its core capability lies in semantic similarity search, supporting fast matching of embedded vectors such as text and images, and is widely used in scenarios such as large model context enhancement ( RAG ), recommendation systems, and multimodal retrieval . Unlike traditional databases, Chroma measures data relevance based on vector distance (such as cosine similarity and Euclidean distance) rather than keyword matching.
GitHub address:
https://github.com/chroma-core/chroma
Official documentation:
https://docs.trychroma.com/
2. Core advantages
Lightweight and easy to use : The code is embedded in the form of Python/JS packages, without the need for independent deployment, suitable for rapid prototyping development.
Flexible integration: supports custom embedding models (such as OpenAI, HuggingFace), and is compatible with frameworks such as LangChain.
High-performance retrieval : HNSW algorithm is used to optimize indexes, supporting millisecond-level responses for millions of vectors.
Multi-mode storage: Memory mode is used for development and debugging, and persistence mode supports data storage in production environments.
2. Installation and basic configuration
1. Install Chroma
Support Windows and Ubuntu operating systems, Python>=3.9
Create a virtual environment and install:
#Create a virtual environment conda create -n chromadb python==3.10#Activate conda activate chromadb#Install chromadb pip install chromadb
Note: Chroma is a local embedded database by default and does not natively support remote access like traditional databases (such as the client-server model of PostgreSQL).
Of course, the official also provides a client-server mode (Client-Server Mode). The server-side startup method is as follows:
#Start the server, the default port number is 8000chroma run --path /db_path
2. Initialize the client
Memory mode (debugging, experimental scenarios):
import chromadbclient = chromadb.Client()
Persistence mode (production environment):
When creating, you can configure the local storage path
import chromadb# Save data to the local directory, fill in the absolute path of path client = chromadb.PersistentClient(path="/path/to/save")
Client-Server mode client:
The first two are local modes, and the Chroma server and client need to be on the same machine. The CS mode can be deployed independently and accessed through httpclient.
import chromadbchroma_client = chromadb.HttpClient(host='localhost', port=8000)
3. Add, delete, modify and query operations
1. Create a Collection
A collection is the basic unit for managing data in Chroma, similar to a table in a traditional database. The name of a collection has the following constraints:
The name must be between 3 and 63 characters long.
The name must start and end with a lowercase letter or number and can contain dots, dashes, and underscores.
The name must not contain two consecutive dots.
The name cannot be a valid IP address.
# Create
collection = client.create_collection(name= "my_collection" , embedding_function=emb_fn)
# Get
collection = client.get_collection(name= "my_collection" , embedding_function=emb_fn)
# If not created, get it if it exists
collection = chroma_client.get_or_create_collection(name= "my_collection2" )
If no embedding function is provided, the default embedding function sentence transformer is used. It uses a small model all-MiniLM-L6-v2, which is mainly for English scenarios. Generally, we need to customize an embedding function:
import chromadb
from sentence_transformers import SentenceTransformer
class SentenceTransformerEmbeddingFunction :
def __init__ ( self, model_path: str , device: str = "cuda" ):
self.model = SentenceTransformer(model_path, device=device)
def __call__ ( self, input : list [ str ] ) -> list [ list [ float ]]:
if isinstance ( input , str ):
input = [ input ]
return self.model.encode( input , convert_to_numpy= True ).tolist()
# Create/load collections (including custom embedded functions)
embed_model = SentenceTransformerEmbeddingFunction(
model_path= r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3" ,
device = "cuda" # No GPU, change to "cpu"
)
# Create a client and collection
client = chromadb.Client()
collection = client.create_collection( "my_knowledge_base" ,
metadata={ "hnsw:space" : "cosine" },
embedding_function=embed_model)
When creating a collect, you can configure the following parameters.
name identifies the name of the collect and is a required field;
embedding_function, specify the embedding function. If it is not filled in, it will be the default embedding model.
metadata, such as indexing method, etc., is not required.
from datetime import datetimecollection = client.create_collection( name="my_collection", embedding_function=emb_fn, metadata={ "description": "my first Chroma collection", "created": str(datetime.now()) } )
There are some common methods for collections:
peek() - Returns a list of the first 10 items in a collection.
count() - Returns the number of items in the collection.
modify() - rename a collection
collection.peek() collection.count() collection.modify(name="new_name")
2. Write data
When writing data, configure the following parameters:
document, the original block of text.
metadatas, metadata describing the text block, kv key-value pair.
ids, unique identifier of the text block, each document must have a uniquely associated id. Adding the same id twice will result in only the initial value being stored.
embeddings: For text blocks that have been vectorized, you can directly write the results. If you do not fill it in, the specified or default embedding function will be used to vectorize the documents when writing.
collection.add( documents=["lorem ipsum...", "doc2", "doc3", ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], ids=["id1", "id2", "id3", ...])
or
collection.add( embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], ids=["id1", "id2", "id3", ...])
3. Modify data
Provide ids (textual unique identifiers).
collection.update( ids=["doc1"], # Use existing ID documents=["RAG is a retrieval enhancement generation technology 222"])
4. Update Insert Method
Chroma also supports upsert operations, which update existing items or add them if they do not yet exist.
collection.upsert( ids=["id1", "id2", "id3", ...], embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], documents=["doc1", "doc2", "doc3", ...],)
5. Delete data
Chroma supports deleting item IDs from a collection using delete. The embeddings, documents, and metadata associated with each item will be deleted.
Also supports where filter. If no id is provided, it will remove the item in the collection with where filter.
# Provide ids
collection.delete(ids=[ "doc1" ])
# where condition deletion
collection.delete(
ids=[ "id1" , "id2" , "id3" ,...],
where ={ "chapter" : "20" }
)
6. Query data
(1) Query all data
all_docs = collection.get()print("All documents in the collection:", all_docs)
(2) Query by ids
An item ID can be retrieved from a collection using get in the following ways.
collection.get( ids=["id1", "id2", "id3", ...],where={"style": "style1"})
(3) Query Embedding
Chroma collections can be queried in a variety of ways using the query method, such as using query_embedding.
collection.query( query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...], n_results=10, where={"metadata_field": "is_equal_to_this"}, where_document={"$contains":"search_string"})
The query will return n_result for each closest matching query embedding, in order.
An optional where filter dictionary can be associated with each document via metadata.
In addition, where document can provide a filter dictionary to filter based on document content.
(4) Query similar documents
You can also pass a set of query texts query_texts. Chroma will first embed each query text with the collection's embedding function and then execute the query using the resulting embeddings.
# Query similar documents results = collection.query( query_texts=["What is RAG technology?"], n_results=3) print("Query results", results)
Query result configuration
When using get or query, you can use the include parameter to specify the data you want to return, including: embeddings, documents, metadatas ; include is an array and can pass multiple values.
For query query, the distances result is returned by default .
For performance reasons, embeddings is not returned by default and None is displayed directly. If you want to return it, include embeddings in include .
An ID is always returned.
The return value contains the included parameter, which indicates the types of data returned this time.
The embeddings will be returned as a 2D NumPy array.
# Only get documents and idscollection.get( include=["documents"])collection.query( query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...], include=["documents"])
Example of query results
{'ids': [['doc1', 'doc3', 'doc2']], 'embeddings': None, 'documents': [['RAG is a retrieval-enhanced generation technology', 'Three Heroes Fighting Lü Bu', 'Vector database stores embedded representations of documents']], 'uris': None,'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'tech_doc'}, {'source': 'tutorial1'}, {'source': 'tutorial'}]], 'distances': [[0.2373753786087036, 0.7460092902183533, 0.7651787400245667]]}
4. Practical Operation
Insert a batch of data into the vector database, and then find similar data from the vector database based on a question.
1. Installation package
pip install sentence_transformerspip install modelscope
2. Download the Embedding model to your local computer
#Model downloadfrom modelscope import snapshot_downloadmodel_dir = snapshot_download('BAAI/bge-m3',cache_dir=r"D:\Test\LLMTrain\testllm\llm")
3. Core logic: writing data and querying similarity
import chromadbfrom sentence_transformers import SentenceTransformerclass SentenceTransformerEmbeddingFunction: def __init__(self, model_path: str, device: str = "cuda"): self.model = SentenceTransformer(model_path, device=device) def __call__(self, input: list[str]) -> list[list[float]]: if isinstance(input, str): input = [input] return self.model.encode(input, convert_to_numpy=True).tolist()# Create/load collection (including custom embedding function)embed_model = SentenceTransformerEmbeddingFunction(model_path=r"D:\Test\LLMTrain\testllm\llm\BAAI\bge-m3", device="cpu" # Change to "cpu" if no GPU, cuda if available)# Create client and collectionclient = chromadb.PersistentClient(path=r"D:\Test\LLMTrain\chromadb_test\chroma_data")collection = client.get_or_create_collection("my_knowledge_base", metadata={"hnsw:space": "cosine"}, embedding_function=embed_model)# Add documentscollection.add( documents=["Embedding representation of documents stored in vector databases", "Three Heroes Fighting Lu Bu","RAG is a retrieval enhancement generation technology"], metadatas=[{"source": "tech_doc"}, {"source": "tutorial"}, {"source": "tutorial1"}], ids=["doc1", "doc2", "doc3"])# Query similar documentsresults = collection.query( query_texts=["What is RAG technology?"], n_results=3)print("Query results", results)
Execution returns the result:
Query results
{
'ids' : [[ 'doc3' , 'doc2' , 'doc1' ]],
'embeddings' : None ,
'documents' : [[ 'RAG is a retrieval enhancement generation technology' , 'Three heroes fighting Lu Bu' , 'Vector database stores embedded representation of documents' ]],
'uris' : None ,
'included' : [ 'metadatas' , 'documents' , 'distances' ],
'data' : None ,
'metadatas' : [[{ 'source' : 'tutorial1' }, { 'source' : 'tutorial' }, { 'source' : 'tech_doc' }]],
'distances' : [[ 0.2373753786087036 , 0.7460092902183533 , 0.7651787400245667 ]]
}