How to build a multi-source knowledge graph extractor from scratch

Written by
Jasper Cole
Updated on:June-22nd-2025
Recommendation

A new automated approach to building knowledge graphs, providing solutions for organizations with limited resources.

Core content:
1. The importance of knowledge graphs and their role in RAG applications
2. Traditional challenges and new opportunities in building
knowledge graphs 3. Applications and challenges of large language models in automated knowledge graph construction

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
introduction

Knowledge graphs are a powerful solution for representing complex and interconnected information. They model data through graphs consisting of entities (nodes) and relationships (edges), making it possible to represent and process connections between different facts.

Although knowledge graphs have long been used in search and recommendation applications, their attention has risen significantly in recent years due to the widespread use of retrieval-augmented generation (RAG) applications. In fact, GraphRAG is a technique that enhances the context before generating large language models by retrieving information from knowledge graphs, which helps improve the performance of RAG systems.

Vanilla RAG systems rely on dense vector storage to retrieve information by semantic similarity. As a result, they have difficulty with entity disambiguation, multi-hop question answering, and combining data scattered across multiple sources. GraphRAG enhances the retrieval process by introducing a knowledge graph that improves contextual understanding, captures relationships between different entities, and combines information from multiple sources.

While structuring information into knowledge graphs has multiple advantages, the process of creating knowledge graphs from large pools of data is often hampered by high costs and resource requirements. Until recently, building knowledge graphs required extensive manual annotation, domain expertise, and complex processes, making it accessible only to large organizations with ample budgets. Fortunately, rapid advances in the capabilities of large language models have opened up the possibility of automating the process of creating knowledge graphs at a fraction of the cost and time required for manual annotation.

Although knowledge graphs automatically extracted by LLMs may not be as good in quality as knowledge graphs manually annotated by experts, they may be the only option when large amounts of data need to be structured and the budget is limited. After all, in many cases, an imperfect but fairly complete and extensive knowledge graph is often better than no knowledge graph at all. For example, in Graph RAG applications, knowledge graphs are used as a supporting tool for retrieving relevant context. In fact, irrelevant information can still be ignored by downstream LLMs, while the details that are ultimately needed to generate answers can be obtained directly from the original source.

There are several approaches to leveraging large language models to automate the construction of knowledge graphs, each with certain trade-offs in cost and quality, depending on the domain and the specific characteristics of the text data that needs to be structured. A popular architectural pattern is to use a multi-step pipeline consisting of an extraction and aggregation phase. The extraction phase leverages a large language model to extract relation triples of the form (subject, relation, object), optionally specifying the entities (subject and object) and the relation type. The aggregation phase focuses on unifying the extracted entities and relations, as large language models can extract the same entities and relations in text-varying forms. For example, in the same knowledge graph, an entity referring to Marie Curie might be extracted as "Marie Curie" or "Maria Salomea Skłodowska-Curie".

Entity disambiguation is even more critical when extracting information from multiple sources or when extending an existing knowledge graph: in these cases, the sources themselves may use different terminology to refer to the same underlying entities and relations, or may use different terminology than what is already in the existing knowledge graph.

Long-context models that can handle very long/multiple sources may help alleviate this problem, but they still suffer from several limitations. In fact, despite recent progress on long-context retrieval and needle-in-the-needle tasks, LLMs tend to degrade in performance on very long contexts, especially for complex tasks. Moreover, models that can handle million-token long contexts typically support much smaller output lengths, which may not be sufficient to accommodate all relations extracted from very long contexts (unless these relations are very sparse in the original sources). Even in this case, the model may still extract the same entities with minor name variations, especially if these entities exist in the original sources. Furthermore, relying solely on long contexts by itself does not solve the problem of extending existing knowledge graphs, as they are extracted from original sources, which are not always available, and even if they are available, the entire knowledge graph still needs to be extracted from scratch, which may incur additional costs. Another point to consider is that extracting relations from smaller chunks allows better localization of the location of the relations in the original source. This additional information is particularly useful when dealing with RAG applications, allowing only the most meaningful parts of the original source to be included in the context, thus improving the performance of LLMs and helping humans evaluate and validate the generated answers.

In this article, I will show how to create a simple and straightforward implementation of an agent workflow to extract and extend a consistent knowledge graph from multiple sources. The workflow consists of two phases: extraction and construction. In the extraction phase, a large language model (LLM) performs basic (subject, relation, object) triple extraction on a text snippet. In the construction phase, the LLM examines each relation triple extracted in the extraction phase and incrementally adds it to the knowledge graph being built. During this process, the LLM can decide to discard redundant triples whose information is already represented in the knowledge graph, or to refine triples to better align with existing relations. For example, if an entity is already included in the knowledge graph under a different name, the model can modify the name of the entity in the new triple to maintain consistency and ensure linkage.

I’ve shared the full code in this GitHub repository, and you can try out the workflow in this Colab Notebook.

https://github.com/GabrieleSgroi/knowledge_graph_extraction

https://colab.research.google.com/drive/1st_E7SBEz5GpwCnzGSvKaVUiQuKv3QGZ

Workflow


In this section, I will describe the main aspects of the architecture and workflow. Please refer to the code in the shared GitHub repository for specific details.

To address the challenge of building a consistent knowledge graph from different input documents, the process is divided into two main stages: extraction and construction. This is achieved through a single agent workflow that coordinates the extractor and builder LLMs. I use the term "workflow" following the nomenclature introduced by Anthropic in the blog post Building effective Agents to represent a controlled system of agents that are constrained by predetermined code paths.

The workflow architecture is as follows

High-level overview of the agent workflow. Image courtesy of the author.

Extraction phase


First, the document is segmented into smaller text chunks. The size of the chunks can be adjusted based on the information density of the source document and the needs of other downstream tasks that the knowledge graph needs to satisfy. For example, if the knowledge graph is later used for contextual retrieval in a graph RAG application, the size of the chunks should be optimized for this task. In fact, these chunks will be linked with the extracted relations for subsequent retrieval.

Each block is processed through a large language model to extract triplets of the form:

(subject: entity_type, relation, object: entity_type)

Entity types can be selected directly by large language models, but in my tests, the extraction quality improves significantly when the model is provided with the entity types to be extracted before extraction. This also helps to guide the extraction process to focus only on the required information. It is worth noting that the triples are directed: the first entry is the subject of the relationship, while the last entry is the object of the relationship, which is consistent with the common subject-verb-object structure in English. The generated knowledge graph will contain this information in the form of a directed graph.

The hints for the extractor model are as follows

You are an expert assistant tasked with automating the creation of a knowledge graph from text. You will be provided with a piece of text and some allowed entity types: your task is to extract relations between entities of the specified type from the provided text.
Guidelines:
- Provide your output in the form of triples (entity1:type1,relation,entity2:type2)
- Only extract triples involving entities of the user-specified type, and do not extract triples involving entities of other types.
- Do not generate duplicate triplets
- Do not extract attributes or properties as entities
- Entity names should be concise and contain all necessary information to uniquely identify the entity
- Keep entity names consistent: the same entity should have the same name in all triples it appears in
- Output only the extracted triples, nothing else
Here are some examples:
---
<user>{user_instruction} Gold (Au) is a chemical element with the chemical symbol Au (from the Latin word aurum) and an atomic number of 79. Pure gold is a bright, slightly orange-yellow, dense, soft, plastic and ductile metal. Chemically, gold is a transition metal belonging to the 11th group of elements and is also one of the precious metals. In 2023, the world's largest gold producer is China, followed by Russia and Australia.
{allowed_types_prompt} 'CHEMICAL', 'COUNTRY'</user>
<assistant>(gold:CHEMICAL, is a kind of, chemical element:CHEMICAL), (gold:CHEMICAL, chemical symbol, Au:CHEMICAL), (gold:CHEMICAL, is a kind of, transition metal:CHEMICAL), (gold:CHEMICAL, is a kind of, precious metal:CHEMICAL), (gold:CHEMICAL, used for, jewelry: APPLICATION), (China:COUNTRY, the largest producer, gold:CHEMICAL), (Russia: COUNTRY, the second largest producer, gold:CHEMICAL), (Australia: COUNTRY, the third largest producer, gold:CHEMICAL) </assistant>
---
<user>{user_instruction} The lion (Panthera leo) is a large cat of the genus Panthera, native to Africa and India. It is sexually dimorphic; adult males are larger than females and have a prominent mane.
{allowed_types_prompt} 'ANIMAL', 'LOCATION'</user>
<assistant>(lion:ANIMAL, scientific name, Panthera leo:ANIMAL), (lion:ANIMAL, belongs to, genus Panthera:ANIMAL), (lion:ANIMAL, native to, Africa:LOCATION), (lion:ANIMAL, native to, India:LOCATION), (male lion:ANIMAL, lioness:ANIMAL)</assistant>
---

Relation triples are listed in LLM's answer. A regular expression-based parser is used to extract them from the text. Next, they are validated to ensure that they conform to the required format and contain only allowed entity types. Triples with malformed format or entity types that do not conform to the allowed types are discarded.

Build Phase

In the construction phase, the triples extracted from the extraction phase are evaluated and integrated into the knowledge graph one by one. For each candidate triple, LLM determines:
1. Whether the relationship needs to be added to the knowledge graph. If the information it conveys already exists, it will be discarded;
2. Whether the contents of the triple, including the entity name and type, need to be modified to ensure consistency with the existing relationship.

To make this decision, the LLM is provided with a set of extracted triples that are most similar to the triple currently being examined. While a long-context model might be able to keep the entire knowledge graph in its context, providing only the most similar relations makes the LLM’s task simpler and also reduces costs.

For simplicity, I used similarity between neural embeddings of text triples for retrieval in my code implementation. This is not part of the workflow and can be replaced based on the most appropriate retrieval method for the problem under study. More specifically, retrieval techniques that take graph structure into account are likely to produce better results, especially in the case of dense graphs or when there are a large number of high-degree nodes. However, this is beyond the scope of this tutorial.

The tips for adding triples and building the knowledge graph are as follows

You are an expert assistant helping to expand the knowledge graph with new information. You will be given a triple (subject:type, relation, object:type) and you need to check if the information conveyed by this triple already exists in the graph. To do this, you will be given a set of similar relations that already exist in the knowledge graph.
guide:
- If the information conveyed by the triplet already exists in the graph, discard the triplet
- If the entity is not one of the allowed types, the triple is discarded. Note that the type in the proposed triple may be wrong
- Keep the graph consistent: If the entity mentioned in the new relation already exists in the graph, the triple should be modified to match the name and type of the existing entity. If the information conveyed by the triple already exists in the graph, the triple is added to the graph, or {discard_triplet_value}. Please share very brief thoughts before giving your answer.
Answer using the following format:
"""
{thought_start} Your explanation goes here{thought_end}
{add_key} (subject:type, relation, object:type)
"""
Don't write anything else after the triple. Here are some examples:
---
Similar extracted relations:
(Maria Salomea Skłodowska-Curie:person, known as, Marie Curie:person)
(Marie Curie:person, nationality, Polish:nationality)
(Marie Curie: person, spouse, Pierre Curie: person)
Allowed entity types: 'person', 'nationality'
Proposed triple: (Marie Curie:person, full name, Maria Salomea Skłodowska-Curie:person)
{thought_start} This information already exists in the graph as (Maria Salomea Skłodowska-Curie:person, known as, Marie Curie:person), so this triple should not be added. {thought_end}
{add_key} {discard_triplet_value}
---
Similar extracted relations:
(Marie Curie:person, discovered, polonium:chemical)
(Marie Curie:person, award, Nobel Prize in Chemistry:prize)
(Marie Curie:person, award, Nobel Prize in Physics:prize)
Allowed entity types: 'person', 'chemical', 'prize'
Proposed triple: (Maria Salomea Skłodowska-Curie: person, discovery, radium:chemical)
{thought_start} This relation is not in the graph and should be added to it. The subject "Maria Salomea Skłodowska-Curie" already exists in the graph as "Marie Curie" of type 'person', I will modify the triple accordingly. {thought_end}
{add_key} (Marie Curie:person, discovered, radius:chemical)


Finally, each relation contained in the knowledge graph is linked to the original paragraph it was extracted from. This is very useful for further applications (such as GraphRAG) that can benefit from combining semantic search with graph search or can benefit from having the original text in context.

Example


In this section, I will show some examples using the workflow described in this article. All examples are generated using  gemini-2.0-flash  as the LLM engine because it has good understanding, long context support, and a rich free trial plan. I tried multiple "small" open source models that can be run on the free Colab notebook, but the results were not satisfactory and execution time became a bottleneck.

The source text for these examples is Wikipedia page summaries obtained using  the wikipedia  Python package. Visual representations of the knowledge graph are   created using the networkx package.

The workflow identified key cross-company connections. Furthermore, we can see the effectiveness of the knowledge graph in integrating indirect relationships between entities that do not appear directly in the document. Using  networkx , we can easily extract the shortest undirected path between any two entities from the connected components of the knowledge graph   . For example, the relationship between the node containing Mike Krieger and the node representing Mark Zuckerberg is as follows:

('mike krieger', 'launched', 'instagram'),
 ('facebook', 'acquired', 'instagram'),
 ('mark zuckerberg', 'created', 'facebook')


Example 2

Let's visualize another example by extracting relations from the summaries of the Wikipedia pages of some famous Ghibli films: How to Move the Castle (the movie), A Boy and the Crane, and Princess Mononoke.

Knowledge graph extracted from the abstracts of the Wikipedia pages “Howl's Moving Castle (film)”, “The Boy and the Heron”, and “Spirited Away”. The extracted entity types are: 'person', 'movie', and 'company'.

Again, we can see how the main entities are related to each other and how indirect relationships appear in the knowledge graph.

Example 3

Another interesting example is to consider the set of symptoms that are common between different diseases. This could be useful, for example, to get a list of all diseases that have the same set of symptoms.

The following knowledge graph was created from the summaries of the Wikipedia pages "Chest pain" and "Shortness of breath".

Knowledge graph extracted from the Wikipedia pages “Chest pain” and “Shortness of breath”. Image courtesy of the author.

in conclusion


In this article, I showed how to create a knowledge graph extractor that is able to maintain consistency when extracting entities from multiple sources. The approach consists of an agent workflow divided into two stages: the first stage is used to extract relations, and the second stage is used to integrate these relations into the existing knowledge graph. Adding an additional validation step before integrating relations into the knowledge graph can improve consistency and reduce format extraction conflicts between different documents/paragraphs.

As with all large language model-based workflows, this approach is subject to the capabilities and limitations of the underlying model. For example, hallucinations may lead to the inclusion of fictitious relations in the knowledge graph, or the model may fail to correctly identify two nodes with different names but referring to the same entity, resulting in uncorrected entity duplication.

Another limitation inherent in the proposed workflow structure is that triples that have been integrated into the knowledge graph cannot be modified. This may become particularly important for information integration and optimization as the knowledge graph scales up. One possible way to address this issue is to add a revision mechanism by extending the workflow to enable modification of existing triples during the construction phase.

One factor to consider is the increased cost of this workflow compared to a single-step knowledge graph extractor. In fact, the need to evaluate newly extracted relations before adding them to the graph during the construction phase significantly increases the amount of computation and the number of API calls. Therefore, it becomes important to evaluate whether the relative improvement in quality is worth the corresponding increase in cost. However, the total cost is still a fraction of that of manual annotation.

As a final consideration for downstream applications, it is worth emphasizing that the workflow discussed links relations in the knowledge graph to the original documents or paragraphs from which they were originally extracted. This is very useful for applications such as GraphRAG that benefit from extended context, while also making human evaluation and correction much easier. In fact, the full power of automated extraction techniques, such as the ones discussed in this article, is realized when combined with human supervision: the LLM can bear the burden of analyzing large amounts of textual information, while human supervisors can ensure the accuracy of facts and optimize the extraction results.

As large language models continue to improve in performance and decrease in cost, automated knowledge graph creation will become more common, unlocking more value from large amounts of unstructured text data.