In today's AI era, Retrieval Augmented Generation (RAG) has become a key technology to improve the quality of answers from large language models. This article will explore how to combine knowledge graphs and vector databases to build a smarter graph RAG system to make AI answers more accurate and reliable.

What is GraphRAG and why do we need it?

Imagine that traditional RAG is like a diligent but limited-vision librarian who can only find relevant books based on keyword matching. Graphic RAG is like a knowledgeable expert who not only knows the content of each book, but also understands the intricate connections between them.

Although traditional RAG can retrieve relevant fragments from massive documents, it is like looking at the world with a blindfold on - you can see the content but not the connections between the contents. Graph RAG opens the door to data relationships by introducing the "magic key" of knowledge graph, allowing AI to understand not only "what" but also "why" and "how", thereby providing more comprehensive and in-depth answers.

There are currently three main ways to implement graph RAG:

The first is vector-based retrieval, which vectorizes the knowledge graph and stores it in a vector database, and retrieves it through similarity matching.

Second, related entity prompt query retrieval uses LLM to convert natural language into SPARQL or Cypher query statements to directly query the knowledge graph.

The last one is a hybrid method that combines the two, combining the advantages of both, first using vector search for preliminary retrieval, and then using knowledge graphs for screening and optimization.

Experimental comparison: advantages and disadvantages of three methods

We take the e-commerce product recommendation system as an example to compare the performance of the three methods in semantic search, similarity calculation and RAG.

Method 1: Vector database search

First, we vectorize product descriptions and user reviews and store them in the Milvus vector database:

# Define data schema collection_name = "products" dim = 1536 # OpenAI embedding dimension # Create collection collection = Collection(name=collection_name) collection.create_field(FieldSchema("id", DataType.INT64, is_primary=True)) collection.create_field(FieldSchema("title", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("description", DataType.VARCHAR, max_length=2000)) collection.create_field(FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim)) # Write data with collection: for index, row in df.iterrows(): embedding = get_embedding(row.title + " " + row.description) collection.insert([ [index], [row.title], [row.description], [embedding] ])

Semantic search test : When searching for "lightweight waterproof sports shoes", related products are returned:

Ultra-light breathable running shoes

《Waterproof outdoor hiking shoes》

《Multifunctional sports training shoes》

Here, the vector database demonstrates good semantic understanding and can find functionally related products, i.e.

Make sure their descriptions are not completely consistent.

When the user asks "Recommend shoes for running in the rain", the system retrieves relevant products and generates suggestions:

Here are some recommended shoes for running in the rain: - Waterproof and breathable running shoes XYZ use special rubber outsoles to provide excellent grip - All-weather sports shoes ABC are equipped with water-repellent fabrics and are lightweight and suitable for long-distance running - Professional cross-country running shoes DEF have a drainage design that can dry quickly even when stepping on water

However, we discovered a problem: the vector database may return products that are visually similar but functionally mismatched (such as fashionable casual shoes), which can cause "context pollution" and make the recommendations generated by LLM less precise.

Method 2: Knowledge Graph Retrieval

Next, we build the same data into a knowledge graph:

# Create entities and relationships g.add((product_uri, RDF.type, Product)) g.add((product_uri, name, Literal(row['title']))) g.add((product_uri, description, Literal(row['description']))) # Add product attributes and classification relationships for feature in features: feature_uri = create_valid_uri("http://example.org/feature", feature) g.add((feature_uri, RDF.type, Feature)) g.add((product_uri, hasFeature, feature_uri))

Semantic search test : We not only searched for the "waterproof" tag, but also used the hierarchical relationship of the product body to search for related concepts such as "water-repellent" and "quick-drying":

# Get related concepts of waterproofing related_concepts = get_all_related_concepts("WaterProof", depth=2) # Convert all concepts to URI for query feature_terms = [convert_to_feature_uri(term) for term in flat_list]

The result is:

《All-weather waterproof hiking shoes》 (labels: waterproof, wear-resistant, outdoor)

"Quick-drying Water Trekking Shoes" (labels: quick-drying, non-slip, water sports)

"Gore-Tex Professional Running Shoes" (Tags: Waterproof, Breathable, Professional Running)

The advantage of knowledge graphs is that the results are highly explainable and we know why each product was selected.

Method 3: Hybrid approach

Finally, we combine the advantages of both approaches:

INT64, is_primary=True)) collection.create_field(FieldSchema("title", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("description", DataType.VARCHAR, max_length=2000)) collection.create_field(FieldSchema("features", DataType.VARCHAR, max_length=500)) collection.create_field(FieldSchema("product_uri", DataType.VARCHAR, max_length=200)) collection.create_field(FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=dim)) Let’s first use vector search to get some preliminary results: # Search for shoes suitable for rainy day running search_params = { "metric_type": "COSINE", "params": {"nprobe": 10} } results = collection.search( [get_embedding("Shoes suitable for running in rainy days")], "embedding", search_params, limit=20, output_fields=["title", "description", "features", "product_uri"] ) Then use the knowledge graph to filter and sort: # Filter out products that are truly waterproof and have running functions query = """ SELECT ?product ?title ?description WHERE { ?product hasFeature ?feature1. ?product hasFeature ?feature2. ?product name ?title. ?product description ?description. FILTER (?product IN (%s) && ?feature1 IN (%s) && ?feature2 IN (%s)) } """

This hybrid approach solves the context contamination problem and ultimately returns the shoes that are truly suitable for rainy day running:

《GTX waterproof professional marathon running shoes》

《All-weather water-resistant racing running shoes》

《Anti-slip and waterproof cross-country running shoes》

Conclusion and practical suggestions

From the above comparison, we can see that the advantages of vector databases are simple and fast deployment. Milvus provides high-performance vector retrieval, which is particularly suitable for large-scale product libraries. The disadvantages are that the results are unexplainable and there is a risk of contextual contamination.

The advantage of knowledge graphs is that the results are highly controllable and explainable, and irrelevant content can be accurately filtered out. The disadvantage is that knowledge graphs need to be built and maintained, and query writing is complex.

As for the hybrid method, the efficient retrieval of Milvus and the accuracy of the knowledge graph are utilized to ensure both the retrieval speed and the improvement of the recommendation quality.

In actual application recommendations, such as content recommendations, both topic similarity and content relevance should be considered to avoid recommending content that is superficially similar but actually irrelevant. Or for customer service, ensure that the answers are not only relevant but also take into account the compatibility and matching relationship between products.

Graph RAG is not only a combination of technologies, but also a leap forward in improving the intelligence of AI systems. Through Milvus's efficient vector retrieval and the relationship understanding of knowledge graphs, our AI is no longer a simple "keyword matching machine", but an "intelligent consultant" that truly understands user needs.

Last words

Today in 2025, AI innovation has been gushing out, with new technologies appearing almost every day. As a technician who has experienced three waves of AI, I firmly believe that AI is not to replace humans , but to free us from repetitive work and focus on more creative things. Follow our official account Pocket Big Data to explore the infinite possibilities of big model implementation together !

Knowledge Graph + Vector Database: Building a Smarter RAG System