Microsoft GraphRAG analysis: How knowledge graphs improve AI retrieval capabilities

Explore how Microsoft GraphRAG revolutionizes AI search technology through knowledge graphs.
Core content:
1. Analysis of GraphRAG technology and its improvement of RAG search capabilities
2. The evolution of knowledge graphs and their key role in RAG
3. Actual case analysis: Performance comparison between baseline RAG and GraphRAG in The Hound of the Baskervilles
The retrieval step of most RAG techniques relies on our ability to retrieve relevant documents from vector databases. Although most vector databases use efficient similarity calculation algorithms, and we use the best chunking, sorting, and re-ranking strategies, baseline RAG still has difficulty handling situations where the query does not fully match the source content (that is, the query should be semantically highly similar to the corpus). Moreover, it does not have the ability to answer global questions or draw high-level conclusions, and cannot understand the deep meaning behind the relationships between data entities. Other RAG techniques focus on what to do when the corpus is larger than the context window of the LLM, while advanced RAG systems include steps such as pre-retrieval and post-retrieval strategies and the use of techniques such as query expansion to improve a given query. This is exactly the problem that GraphRAG , launched by Microsoft, solves.
In this article, we will learn what knowledge graphs are, how they evolved, and why they are needed in RAG. We will implement Baseline RAG and GraphRAG on Sir Arthur Conan Doyle's popular novel The Hound of the Baskervilles (https://www.gutenberg.org/cache/epub/2852/pg2852.txt) (Project Gutenberg License) and compare the results to understand the performance of each strategy, and finally understand some things to watch out for and the shortcomings of GraphRAG.
What is RAG
RAG, or Retrieval-Augmented Generation, is a technique that enhances LLM responses by incorporating relevant information from external sources, thereby strengthening the model's pre-trained parameter memory without retraining the model itself.
Baseline RAG
The following code demonstrates the implementation of the baseline RAG. We have an embedding model that converts our novels into vectors and stores them in Azure Vector Search. This can be used to perform hybrid search to get the top k (=3) most semantically relevant results for any query. We pass the results along with the query to the LLM to generate an appropriate response.
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
llm = AzureChatOpenAI(
azure_endpoint=os.getenv( "AZURE_OPENAI_ENDPOINT" ),
azure_deployment=os.getenv( "AZURE_OPENAI_DEPLOYMENT_NAME" ),
openai_api_version=os.getenv( "AZURE_OPENAI_API_VERSION" ),
)
embeddings = AzureOpenAIEmbeddings(
azure_endpoint=os.getenv( "AZURE_OPENAI_ENDPOINT" ),
azure_deployment=os.getenv( "AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME" ),
openai_api_version=os.getenv( "AZURE_OPENAI_EMBEDDING_API_VERSION" ),
)
vector_store = AzureSearch(
index_name= "hound-of-baskervilles" ,
embedding_function=embeddings.embed_query,
azure_search_endpoint=os.getenv( "AZURE_SEARCH_ENDPOINT" ),
azure_search_key=os.getenv( "AZURE_SEARCH_KEY" ),
)
loader = TextLoader(book_path, encoding= "utf-8" )
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size= 1200 , chunk_overlap= 100 )
docs = text_splitter.split_documents(documents)
vector_store.add_documents(documents=docs)
def retrieve ( query ):
docs = vector_store.hybrid_search(query=query, k= 3 )
context = "\n\n" .join(doc.page_content for doc in docs)
print (context)
return context
rag_prompt = ChatPromptTemplate.from_messages([\
( "human" , """\
You are an assistant for question-answering tasks.\
Only use the following pieces of context to answer the question.\
If you don't know the answer or the answer cant be dereived from the given context, just say that you don't know.\
Use three sentences maximum and keep the answer concise.\
Question: {question}\
Context: {context}\
Answer: """ ),\
])
def get_response ( query ):
messages = rag_prompt.invoke({ "context" : retrieve(query), "question" : query})
response = llm.invoke(messages)
return response.content
get_response( "Who is Sir Henry Baskerville?" )
Output:
Sir Henry Baskerville is the young baronet,
about thirty years old, who is the last of the Baskerville family.
He is the nephew of the late Sir Charles Baskerville.
He has recently arrived in England from Central America.
If you need a quick review of RAG, check out my article:
RAG 101: Introduction, the what, why and how of plain RAG:
(https://ai.gopubby.com/rag-101-introduction-aa1138b1dcf3)
The evolution of search engines
In any RAG approach, there are two steps: retrieval and generation. The generation step is more or less the same for all approaches, and we use LLM to summarize all the data we get from the retrieval step. Retrieval is where most strategies differ. This step is crucial because if we fail to provide the LLM with proper documents, the generated response will also be affected. Therefore, there is a direct correlation between how well the retrieval step performs and the overall response of the RAG.
Retrieval is essentially searching within our documents, so in essence we are building a search engine just for our private data. It is no surprise that the retrieval strategy also follows similar trends to traditional search engines.
• Keyword Search – First is the keyword based search where we look for the exact occurrence of the keyword in the data. Even synonyms or variations of the keyword may not be identified, so the search text should match the word in the corpus exactly. On top of that, the inverted index is a data structure that contains a vocabulary of all the unique words in the dataset and where the words occur in the document. Additionally, the frequency of the words can be stored, allowing for quick retrieval of documents matching the query.
• Best Match Algorithms – These ranking algorithms use term frequency, document frequency, and document length to calculate the document score for each query. BM25 is the most commonly used ranking function, which solves some pain points that simple keyword search cannot solve, thereby improving the quality of retrieved documents.
• PageRank —Originally developed by Google, it assigns a score to each page based on the quality of incoming and outgoing links and content. It is the next best step in ranking millions of documents, although contextual relevance is slightly reduced and the chronological order of pages is very important in ranking. Document ranking and re-ranking is also a common strategy used to enhance the retrieved documents in baseline RAG.
• Semantic Search – Prior to this step, all other functions/algorithms/techniques only focused on which words to search for, rather than actually understanding the user query. Semantic search interprets the intent behind the user query and considers the relationship between words. All search indexers today offer pure keyword, semantic, and hybrid search, which, as the name suggests, considers both keywords and the semantic meaning of the query.
• Knowledge Graphs - Although knowledge graphs existed for decades before they were used in search, intentionally or unintentionally, the evolution of search has always been about building a connected graph. PageRank and semantic search also rely heavily on data structures that resemble some kind of graph. Even in Google’s use to enhance search results (https://en.wikipedia.org/wiki/Google_Knowledge_Graph), the knowledge graph is represented as a directed labeled graph where nodes represent entities and edges represent relationships between nodes. So when a user query matches a specific node or edge, the corresponding highly connected nodes/edges can also be added up to provide more context for the response.
Therefore, it is natural that the retrieval step of RAG also evolves into the knowledge graph stage.
What is a Knowledge Graph
It is a semantic representation of the relationship between two entities. It provides a structure to unstructured data so that machines can understand how entities are related to each other and what properties they share. All entities in a knowledge graph are represented as nodes and all relationships between nodes are represented as edges. A node can be any object - a person, place, organization or event. An edge can be any attribute or property that two nodes are related to.
Using LLM to build a knowledge graph
Traditionally, building a knowledge graph is a manual and time-consuming task. It requires close collaboration between subject matter experts and data scientists who review large amounts of data and understand how each entity corresponds to each other. But with the advent of LLM, we can automate almost all the steps of building a knowledge graph. As we have seen before, the most important building blocks of any knowledge are nodes. Let's write a simple prompt to extract names, places, organizations, and events.
Extract entities and their relationships
Let's try to get names, places, organizations, and events using basic hints. Eventually, we would like to give users the flexibility to mention their entity types as they are very domain specific and will greatly enhance the extraction process.
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_endpoint=os.getenv( "AZURE_OPENAI_ENDPOINT" ),
azure_deployment=os.getenv( "AZURE_OPENAI_DEPLOYMENT_NAME" ),
openai_api_version=os.getenv( "AZURE_OPENAI_API_VERSION" ),
)
entity_extraction_prompt= """
Given a document identify all entities from the text.
Entity types can be one of: Name, Place, Organization or Event
Document: {text}"""
def extract_entites ( text ):
prompt = ChatPromptTemplate.from_messages([entity_extraction_prompt])
messages = prompt.invoke({ "text" : text})
response = llm.invoke(messages)
return response.content
Using the first paragraph of the novel as input, we get the following identified entities.
extract_entites( """
Mr. Sherlock Holmes, who was usually very late in the mornings,
save upon those not infrequent occasions when he was up all
night, was seated at the breakfast table. I stood upon the
hearth-rug and picked up the stick which our visitor had left
behind him the night before. It was a fine, thick piece of wood,
bulbous-headed, of the sort which is known as a "Penang lawyer."
Just under the head was a broad silver band nearly an inch
across. "To James Mortimer, MRCS, from his friends of the
CCH," was engraved upon it, with the date "1884." It was just
such a stick as the old-fashioned family practitioner used to
carry—dignified, solid, and reassuring.
""" )
Output:
Entities identified in the text:
- Name: Sherlock Holmes
- Name: James Mortimer
- Organization: MRCS
- Organization: CCH
- Event: 1885
Not bad for a basic hint. Let's also add _entity_types_ as a variable and give a few examples of specifying the structure to get output in the desired format. This is exactly what is being used in GraphRAG. From the GraphRAG code, the hint to extract an entity and its relationships is done with a single hint:
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ( "entity" {tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
Format each relationship as ( "relationship" {tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
4. When finished, output {completion_delimiter}
######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 pm PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
( "entity" {tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
{record_delimiter}
( "entity" {tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
{record_delimiter}
( "entity" {tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
{record_delimiter}
( "relationship" {tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
{completion_delimiter}
######################
Example 2:
Entity_types: ORGANIZATION
Text:
TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.
TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
######################
Output:
( "entity" {tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
{record_delimiter}
( "entity" {tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
{record_delimiter}
( "relationship" {tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
{completion_delimiter}
######################
Example 3:
Entity_types: ORGANIZATION,GEO,PERSON
Text:
Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.
The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.
The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.
They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.
The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
######################
Output:
( "entity" {tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
{record_delimiter}
( "entity" {tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
{record_delimiter}
( "entity" {tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
{record_delimiter}
{record_delimiter}
( "entity" {tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
{record_delimiter}
( "entity" {tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
{record_delimiter}
( "entity" {tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
{record_delimiter}
( "entity" {tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
{record_delimiter}
( "entity" {tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
{record_delimiter}
( "entity" {tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
{record_delimiter}
( "entity" {tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
{record_delimiter}
( "relationship" {tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
{record_delimiter}
( "relationship" {tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglini were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
( "relationship" {tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
{completion_delimiter}
######################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:
We get the output in the specified format:
'("entity"<|>SHERLOCK HOLMES<|>PERSON<|>Mr. Sherlock Holmes is a detective who often stays up all night and was seated at the breakfast table in the morning)\n##\n("entity"<|>JAMES MORTIMER<|>PERSON<|>James Mortimer is the owner of the stick, engraved "To James Mortimer, MRCS, from his friends of the CCH," with the date "1884")\n##\n("entity"<|>CCH<|>ORGANIZATION<|>CCH is an organization whose friends gifted James Mortimer a stick)\n|COMPLETE|'
Using GraphRAG
To show the true value of GraphRAG, let's remove the last two chapters of the book and try to answer the questions using both baseline RAG and GraphRAG. To get started with GraphRAG, let's index the book into a parqet file using CLI commands and then load it into memory.
import requests
import os
from dotenv import find_dotenv, load_dotenv
book_path = "graphrag/input/book.txt"
response = requests.get( "https://www.gutenberg.org/cache/epub/2852/pg2852.txt" )
if response.status_code == 200 :
if not os.path.isdir( "graphrag/input" ):
os.mkdir( "graphrag/input" )
with open (book_path, "wb" ) as file:
file.write(response.content)
print ( "File saved successfully" )
else :
print ( "Failed to fetch the file" )
load_dotenv(find_dotenv())
!python -m graphrag index --root ./graphrag
A few things to remember about GraphRAG
1. Although the default prompts extract basic entities such as people, places, and organizations, it is better to provide a small number of examples to extract other entities related to your own domain.
2. Although the GraphRAG paper claims that just because there is a prompt that first recognizes yes/no and then suggests that many entities are missing, and therefore a longer context is reasonable, there is still no guarantee that LLM will recognize all entities, especially those that are non-nouns or consist of multiple words.
3. There is no explicit entity deduplication step. Even if an entity is misinterpreted and duplicated as a separate node, they are relatively close to each other and also closely connected. And since GraphRAG generates community summaries, LLM will eventually understand this and generate appropriate summaries.
4. To distribute information throughout the graph, GraphRAG randomizes community reports (removes reports with weight 0 and sorts other reports by relevance score). Only the top n reports that fit the context window are taken; there is no guarantee that the same reports appear every time (if they have the same score); although the conclusion of the question may be the same, the source reports may be completely different.
Limitations of GraphRAG
1. It is obvious from the number of calls it makes to LLM that the indexing cost is very high, and ultimately, at query time, the cost is also relatively higher than the baseline RAG.
2. When retrieving documents, it is common practice to apply filters on the metadata to exclude documents that we know are not relevant in the user context or that the user cannot access. This acts as a security layer where users can only search through documents they have access to. However, since GraphRAG performs local summarization, this can become more complicated in terms of which entities, relationships, or community reports require RBAC (role-based access control) and how to build the graph for different roles.
3. As long as the text has not changed, the embeddings are always the same. Therefore, no matter how many times we generate embeddings for the same document, the baseline RAG will provide the same answer to the same query (or at least the retriever will always provide the same document). On the other hand, since GraphRAG relies on the output of LLM for entity recognition and relation tagging, as well as for generating community summaries and reports, there is a high chance that regenerating the graph on the same corpus may provide different answers to the same query.