In today's AI era, Retrieval Augmented Generation (RAG) has become a key technology to improve the quality of answers from large language models. This article will explore how to combine knowledge graphs and vector databases to build a smarter graph RAG system to make AI answers more accurate and reliable.
What is GraphRAG and why do we need it?
Imagine that traditional RAG is like a diligent but limited-vision librarian who can only find relevant books based on keyword matching. Graphic RAG is like a knowledgeable expert who not only knows the content of each book, but also understands the intricate connections between them.
Although traditional RAG can retrieve relevant fragments from massive documents, it is like looking at the world with a blindfold on - you can see the content but not the connections between the contents. Graph RAG opens the door to data relationships by introducing the "magic key" of knowledge graph, allowing AI to understand not only "what" but also "why" and "how", thereby providing more comprehensive and in-depth answers.
There are currently three main ways to implement graph RAG:
Experimental comparison: advantages and disadvantages of three methods
We take the e-commerce product recommendation system as an example to compare the performance of the three methods in semantic search, similarity calculation and RAG.
Method 1: Vector database search
First, we vectorize product descriptions and user reviews and store them in the Milvus vector database:
# Define data schema collection_name = "products" dim = 1536 # OpenAI embedding dimension # Create collection collection = Collection(name=collection_name) collection.create_field(FieldSchema("id", DataType.INT64, is_primary=True)) collection.create_field(FieldSchema("title", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("description", DataType.VARCHAR, max_length=2000)) collection.create_field(FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim)) # Write data with collection: for index, row in df.iterrows(): embedding = get_embedding(row.title + " " + row.description) collection.insert([ [index], [row.title], [row.description], [embedding] ])
Semantic search test : When searching for "lightweight waterproof sports shoes", related products are returned:
Ultra-light breathable running shoes《Waterproof outdoor hiking shoes》《Multifunctional sports training shoes》
Here, the vector database demonstrates good semantic understanding and can find functionally related products, i.e.
Make sure their descriptions are not completely consistent.
When the user asks "Recommend shoes for running in the rain", the system retrieves relevant products and generates suggestions:
Here are some recommended shoes for running in the rain: - Waterproof and breathable running shoes XYZ use special rubber outsoles to provide excellent grip - All-weather sports shoes ABC are equipped with water-repellent fabrics and are lightweight and suitable for long-distance running - Professional cross-country running shoes DEF have a drainage design that can dry quickly even when stepping on water
However, we discovered a problem: the vector database may return products that are visually similar but functionally mismatched (such as fashionable casual shoes), which can cause "context pollution" and make the recommendations generated by LLM less precise.
Method 2: Knowledge Graph Retrieval
Next, we build the same data into a knowledge graph:
# Create entities and relationships g.add((product_uri, RDF.type, Product)) g.add((product_uri, name, Literal(row['title']))) g.add((product_uri, description, Literal(row['description']))) # Add product attributes and classification relationships for feature in features: feature_uri = create_valid_uri("http://example.org/feature", feature) g.add((feature_uri, RDF.type, Feature)) g.add((product_uri, hasFeature, feature_uri))
Semantic search test : We not only searched for the "waterproof" tag, but also used the hierarchical relationship of the product body to search for related concepts such as "water-repellent" and "quick-drying":
# Get related concepts of waterproofing related_concepts = get_all_related_concepts("WaterProof", depth=2) # Convert all concepts to URI for query feature_terms = [convert_to_feature_uri(term) for term in flat_list]
The result is:
《All-weather waterproof hiking shoes》 (labels: waterproof, wear-resistant, outdoor)"Quick-drying Water Trekking Shoes" (labels: quick-drying, non-slip, water sports)"Gore-Tex Professional Running Shoes" (Tags: Waterproof, Breathable, Professional Running)The advantage of knowledge graphs is that the results are highly explainable and we know why each product was selected.
Method 3: Hybrid approach
Finally, we combine the advantages of both approaches:
INT64, is_primary=True)) collection.create_field(FieldSchema("title", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("description", DataType.VARCHAR, max_length=2000)) collection.create_field(FieldSchema("features", DataType.VARCHAR, max_length=500)) collection.create_field(FieldSchema("product_uri", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim)) Let’s first use vector search to get some preliminary results: # Search for shoes suitable for rainy day running search_params = { "metric_type": "COSINE", "params": {"nprobe": 10} } results = collection.search( [get_embedding("Shoes suitable for running in rainy days")], "embedding", search_params, limit=20, output_fields=["title", "description", "features", "product_uri"] ) Then use the knowledge graph to filter and sort: # Filter out products that are truly waterproof and have running functions query = """ SELECT ?product ?title ?description WHERE { ?product hasFeature ?feature1. ?product hasFeature ?feature2. ?product name ?title. ?product description ?description. FILTER (?product IN (%s) && ?feature1 IN (%s) && ?feature2 IN (%s)) } """
This hybrid approach solves the context contamination problem and ultimately returns the shoes that are truly suitable for rainy day running:
《GTX waterproof professional marathon running shoes》《All-weather water-resistant racing running shoes》《Anti-slip and waterproof cross-country running shoes》
Conclusion and practical suggestions