It is so easy to build an enterprise-level knowledge graph using a big model!

Master the secrets of building enterprise-level knowledge graphs to make complex relationships clear at a glance!
Core content:
1. The unique advantages of knowledge graphs in information retrieval
2. The combined application of RAG system and knowledge graphs
3. Challenges and solutions for building knowledge graphs
When I first heard the term “Knowledge Graph”, it was indeed a bit daunting—not the concept itself, but the process of building it.
I tried to make a knowledge graph before, but failed.
Graphs are indeed one of the best ways to express complex relationships and are widely used in recommendation systems, fraud detection, etc. But what attracts me most is information retrieval.
So I began to explore how to use knowledge graphs to build a more powerful RAG system.
Of course, RAG does not necessarily have to rely on knowledge graphs, and it does not even need a database. As long as you can extract relevant content from massive information and pass it to the context of the large language model (LLM), RAG can work properly.
For example, you can use web search as a retrieval method for RAG, or you can use a vector database for semantic search.
If you choose to use a graph database to retrieve contextual information, we usually call it a "GraphRAG".
However, this article is not about GraphRAG (maybe I will write a special article in the future), but focuses on: how to use LLM to build a knowledge graph . Of course, before that, we still have to mention why knowledge graphs can make RAG more powerful.
Why does RAG use knowledge graphs?
Knowledge graphs do have their own unique skills in retrieving more relevant information. Although vector databases are sufficient in most cases, they are not omnipotent.
The core retrieval method of the vector database is based on the "semantic similarity" of the text. Its general process is as follows:
We use a vector embedding model (such as OpenAI's text-embedding-3
) to convert text into vectors. For example, although the letters of "Apple" and "Appam" (an Indian food) overlap, they are far apart in the semantic space after being converted into vectors. These vectors are then stored in a vector database like Chroma.
When it comes to the retrieval stage, we use the same embedding model to convert the user's query into a vector, and then calculate the distance using methods such as cosine similarity to find the most similar content and return it.
This is the only way we can retrieve information using vector databases. You may have realized that the quality of the retrieval is directly affected by the model you choose. The database itself is not the main issue (of course, concurrency and speed are another matter).
Let’s look at an example:
Let's say you have a very large document that introduces the executive team of many companies.
For simple questions like "Which company is Mr. John Doe the CEO of?", vector search can give very accurate answers because the answer is often contained in a certain embedded text block.
But if you ask, “Who are the people who serve on multiple boards with John Doe?” that’s beyond the capabilities of a normal vector search.
The reason is that vector retrieval can only process explicitly mentioned information , and cannot perform comprehensive reasoning across multiple information sources. Knowledge graphs are different, as they can perform structured reasoning at the global data level.
For example, it can aggregate national nodes and strategic nodes together, allowing you to immediately find the information you need with a simple query statement.
Now that we understand the advantages of knowledge graphs, let’s look at its biggest challenge — building them .
Building a knowledge graph used to be too difficult
A few years ago, a colleague introduced me to the knowledge graph. His idea was to build a unified, searchable graph for all our projects.
I spent a weekend learning about Neo4j and found the idea to be quite promising.
The question is: we have a bunch of PDF, PPT and Word documents on hand, how do we extract the entities (nodes) and relationships (edges) from them?
At that time, there was no good solution, so we could only manually organize these unstructured contents and then convert them into graph data models.
Although it is possible to use PyPDF2 to read PDF and then use keyword search to find nodes and edges, the effect is very poor and the efficiency is extremely low. In the end, I had to give up this solution and label it as " not worth investing ".
But now that LLM has become part of our daily tools, the situation is completely different.
Build a knowledge graph in a few minutes using LLM
Nowadays, extracting information from text or even images is no longer a difficult task.
Although there is still room for improvement in handling unstructured data, the development of LLM in the past few years has indeed opened up new possibilities.
In this section, we will try to use LLM to build a simple (possibly the simplest) knowledge graph and explore how to gradually optimize it towards enterprise-level applications.
This time we used an experimental feature of Langchain :LLMGraphTransformer
, and chose Neo4J Auro, a cloud-hosted graph database, to store graph data.
If you are using LlamaIndex, you can take a look at the KnowledgeGraphIndex
Of course, Neo4j is not the only choice, other graph databases can also do the job.
Let's start by installing the necessary dependencies...
pip install neo4j langchain-openai langchain-community langchain-experimental
In this example, we’ll map a set of business leaders and their organizations into a graph database. If you want to follow along, you can check out the sample data I used — it’s a fictitious dataset I generated using AI.
Here is a surprisingly simple code example for generating a knowledge graph from an unstructured document:
The code above is pretty straightforward.
The most critical part is the process of building a graph database. I have highlighted this section.LLMGraphTransformer
The class will use the LLM we passed in to extract the graph structure information from the document.
You can now convert any Langchain Document
Type passed to convert_to_graph_documents
Method to extract knowledge graph. This source file can be plain text, Markdown file, web page content, or even the return result of another database query.
If this work were done manually, it would take months - which is exactly what we did a few years ago.
You can log in to the Aura graph database console to visualize this graph, and the presentation may look like this:
In this process, the underlying API actually calls an LLM to automatically extract relevant information from the text and construct the Python objects required by Neo4J (used to represent nodes and edges).
Now that we know that it is easy to build a knowledge graph using the extractor API, the next step is to discuss how to make the knowledge graph truly meet the standards of enterprise-level applications .
How to make knowledge graphs reach the standards that can be used by enterprises?
We once gave up building a knowledge graph because it was too complicated. Today, the construction process is indeed much simpler, but the graph we just generated is still far from being "usable for reliable business".
I found several obvious flaws and improved two of them, which are worth highlighting here:
1. Improve control over the map extraction process
If you actually run through the examples, you may notice that in the automatically generated graph, only Person
and Organization
Two types of nodes. Essentially, this extraction process only identifies the person and the company they work for.
Note: Using LLM to extract spectra is essentially a "probabilistic" process. Your results may not be the same as mine.
But in fact, we can extract more information from the text, such as which university a certain executive graduated from, or their past work experience.
So can we tell the system in advance which entity types and relationships we want to extract? Fortunately,LLMGraphTransformer
The class supports this functionality.
You can initialize it as follows:
In this version, we explicitly tell the transformer to recognize three types of entities: people, companies, and universities, and specify the possible relationships between them. This information is critical for LLM to extract entity relationships.
also ,node_properties=True
This parameter will cause the system to extract various properties of nodes as much as possible (even if they have no clear association).
By explicitly setting entity types and relationships, the knowledge graph constructed is usually more complete and accurate. However, if we want to "ensure" that all important information can be extracted as much as possible, we can also use the following method:
2. Propositioning before graph conversion
Text is a somewhat messy form of data, and text written by humans tends to be even messier.
We don’t always put all the information together, even in formal technical documents, it’s often expressed in scattered forms. For example, in this article, I’ve used the terms “Knowledge Graph” and “KG”.
This phenomenon of "information scattered in multiple locations" makes it difficult for LLM to correctly understand the context.
LLMGraphTransformer will first divide the text into chunks internally, and then process each chunk independently. In this way, the information connection between different chunks will be interrupted.
For example, suppose the article mentions that someone is a "CIO" (Chief Information Officer) at the beginning, but this explanation only appears in the first part of the text. Then after the chunking, when the latter chunk mentions "CIO" again, the LLM may not know whether this abbreviation refers to "information" or "investment" or something else.
To solve this problem, we can first perform "propositioning" processing so that each chunk contains the necessary context information.
The specific steps are as follows:
This code uses a prompt word from Langchain's Prompt Hub. From the results, we can see that each sentence is independent, complete, and easy to understand, and no longer relies on other contextual information.
Doing this before building the knowledge graph can greatly reduce the risk of missing nodes or relationships.
Final Thoughts
I tried to build a knowledge graph before, but failed. The cost of building a graph was much greater than the value it brought. But this was before LLM.
I was shocked when I found that LLM can extract graph structure from plain text and store it in a database like Neo4J. Of course, this technology is not mature enough yet.
I don't think these features are ready for production use - unless you're willing to do some "tinkering".
The two methods mentioned in this article are effective means I use to improve the quality of knowledge graphs in practice. I hope they can be helpful to you as well.