Woter AI detection.Hurry - ends Jun 28th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Another open source project: Using LLM to convert unstructured text into knowledge graph

Written by

Caleb Hayes

Updated on:June-24th-2025

Building a knowledge graph from unstructured text is a challenging task. It usually requires identifying key terms, clarifying their interrelationships, and using custom code or machine learning tools to extract this structured information.

We will create an end-to-end pipeline driven by a Large Language Model (LLM) that automatically transforms raw text into an interactive knowledge graph.

All code can be found in my GitHub repository:https://github.com/FareedKhan-dev/KG-Pipeline

Environment Configuration

As with any good project, we need the right tools. We'll use a few key Python libraries to get the job done. First, install them.

# Install the library (this cell only needs to be run once)
pip install openai networkx  "ipycytoscape>=1.3.1"  ipywidgets pandas

After running the installation command, you may need to restart the Jupyter kernel or runtime environment for the changes to take effect.

Once installed, let’s import all the required libraries into our script.

import  openai              # Used to interact with LLM
import  json                # used to parse LLM's response
import  networkx  as  nx      # Used to create and manage graph data structures
import  ipycytoscape        # for interactive graph visualization in notebook
import  ipywidgets          # for interactive elements
import  pandas  as  pd        # used to display data in tabular form
import  os                  # for accessing environment variables (safer for API keys)
import  math                # for basic math operations
import  re                  # for basic text cleaning (regular expressions)
import  warnings            # used to suppress potential deprecation warnings

Our toolbox is ready and all necessary libraries have been loaded into our environment.

What is a knowledge graph?

A knowledge graph is a network of interconnected entities and relationships that represents knowledge in a structured way and supports reasoning and discovery of knowledge. It consists of two main parts:

• Nodes/Entities: These are “things” – like ‘Marie Curie’, ‘Physics’, ‘Paris’, ‘Nobel Prize’. In our project, each unique subject or object we extract will become a node.
• Edges/Relationships: These are the connections between things, showing how they are related. Crucially, these connections have meaning and usually have a direction . For example: 'Marie Curie' — won → 'Nobel Prize'. The 'won' part is the relationship, defining the edge.

The image above shows a simple diagram with two nodes (e.g., "Marie Curie", "Radium") connected by a directed edge labeled "Discovery". There is also a small cluster next to it ("Paris" — Located → "Sorbonne University"). This intuitively demonstrates the concept of "node-edge-node".

Knowledge graphs are powerful because they organize information in a way that is closer to how we think about connections between things, making it easier to discover insights and even infer new facts.

Subject-Verb-Object (SPO) Triples

So how do we get these nodes and edges from plain text? We look for simple declarative facts, which can usually be constructed as Subject-Predicate-Object (SPO) triples.

• Subject: The object described by the fact (e.g., 'Marie Curie'), corresponding to a node in the graph.
• Predicate: The action or relationship connecting the subject and the object (e.g., 'discover'), corresponding to the label of the edge.
• Object: Something related to the subject (e.g., 'Radium'), corresponding to another node.

Example: The sentence "Marie Curie discovered radium" can be perfectly decomposed into triples: (Marie Curie, discovered, radium).

This maps directly to our graph structure:

• (Marie Curie) -[discovered]-> (radium).

The task of a large language model is to extract these basic subject-verb-object triplets from text.

Configuring LLM Connections

We need to configure the script so that it can communicate with the LLM API, which includes providing an API key and an API endpoint (URL).

We will use the NebiusAI LLM API, but you can also use Ollama or any other OpenAI module compatible LLM provider.

# If using the standard OpenAI
export  OPENAI_API_KEY= 'Your_openai_api_key_put here'

# If using a local model, such as Ollama
export  OPENAI_API_KEY= 'ollama' # For Ollama, can be any non-empty string 
export  OPENAI_API_BASE= 'http://localhost:11434/v1'

# If using other providers, such as Nebius AI
export  OPENAI_API_KEY= 'Your provider api key goes here'
export  OPENAI_API_BASE= 'https://api.studio.nebius.com/v1/' # Example URL

First, let's specify the LLM model we want to use. This will depend on the models supported by your API key and configured endpoint.

# --- Define LLM model ---
# Select the model available for the endpoint you configured.
# Example: 'gpt-4o', 'gpt-3.5-turbo', 'llama3', 'mistral', 'DeepSeek-ai/DeepSeek-Coder-V2-Lite-Instruct', 'gemma'
llm_model_name =  "deepseek-ai/DeepSeek-V3" # <-- *** Change this to your model ***

OK, we've identified our target model. Now let's grab our API key and base URL (if needed) from the environment variables we (hopefully) set up earlier.

# --- Retrieve Credentials ---
api_key = os.getenv( "OPENAI_API_KEY" )
base_url = os.getenv( "OPENAI_API_BASE" )  # None if not set (e.g. for standard OpenAI)

The client is now ready to communicate with the LLM.

Finally, we set several parameters that control the behavior of LLM:

• Temperature: Controls the randomness. Lower values mean more focused, more deterministic output (great for fact extraction!). We set it to 0.0 for maximum predictability.
• Max Tokens: Limits the length of LLM responses.

# --- Define LLM call parameters ---
llm_temperature =  0.0 # Lower temperatures give more deterministic factual output. 0.0 is best for extraction tasks. 
llm_max_tokens =  4096 # Maximum number of tokens for LLM response (adjusted based on model limits)

Define input text (raw material)

Now, we need text to be converted into a knowledge graph. We will use the biography of Marie Curie as an example.

unstructured_text =  """
Marie Curie, born Maria Skłodowska in Warsaw, Poland, was a pioneering physicist and chemist.
She conducted groundbreaking research on radioactivity. Together with her husband, Pierre Curie,
she discovered the elements polonium and radium. Marie Curie was the first woman to win a Nobel Prize,
the first person and only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize
in two different scientific fields. She won the Nobel Prize in Physics in 1903 with Pierre Curie
and Henri Becquerel. Later, she won the Nobel Prize in Chemistry in 1911 for her work on radium and
polonium. During World War I, she developed mobile radiography units, known as 'petites Curies',
to provide X-ray services to field hospitals. Marie Curie died in 1934 from aplastic anemia, likely
caused by her long-term exposure to radiation."""

Next, we print out the text content and count its length.

print ( "--- Input text loaded---" )
print (unstructured_text)
print ( "-"  *  25 )
# Basic statistics visualization
char_count =  len (unstructured_text)
word_count =  len (unstructured_text.split())
print ( f"Total number of characters:  {char_count} " )
print ( f"Approximate number of words:  {word_count} " )
print ( "-"  *  25 )

#### Expected output (based on the original example)####
# --- Input text loaded ---
# Marie Curie, born Maria Skłodowska in Warsaw, Poland... (full text print)
# -------------------------
# Total number of characters: 1995 (example value)
# Approximate number of words: 324 (example value)
# -------------------------

We have text about Marie Curie, about 324 words. It may not be ideal for production environments, but it is enough to show the construction process of the knowledge graph.

Chunking

LLM usually has a limit on the length of text that can be processed at a single time (i.e., context length limit).

Our text about Marie Curie is relatively short, but for longer documents, we would definitely need to break it up into smaller pieces, or chunks . Each chunk contains a certain number of words to make it easier to process. Even for this text, chunking can sometimes help the LLM focus on specific parts.

We will define two parameters for this:

• Chunk Size: The maximum number of words we want each chunk to contain.
• Overlap: How many words should overlap between the end of one chunk and the beginning of the next. This overlap helps keep the context coherent and avoids abrupt cuts between chunks.

The above diagram shows the process of segmenting a complete text into three overlapping segments (chunks). Chunk 1, Chunk 2, and Chunk 3 are clearly labeled. The overlap between Chunk 1 and Chunk 2, and between Chunk 2 and Chunk 3 are highlighted. Chunk size and number of overlapping words are also indicated.

Let's set the desired chunk size and number of overlapping words.

# --- Block Configuration ---
chunk_size =  150 # The number of words in each chunk (adjust as needed)
overlap =  30 # Number of overlapping words (must be less than chunk_size)     

print ( f" Chunk size set to:  {chunk_size}  words" )
print ( f"Number of overlapping words is set to:  {overlap}  words" )

# --- Basic Authentication ---
if  overlap >= chunk_size  and  chunk_size >  0 :
    print ( f"Error: Number of overlapping words ( {overlap} ) must be less than chunk size ( {chunk_size} )." )
    # In a real script, this should raise an error or exit
    # raise SystemExit("Block configuration error.")
else :
    print ( "Block configuration is valid." )

### Expected Output###
# Chunk size is set to: 150 words
# The number of overlapping words is set to: 30 words
# Block configuration is valid.

The plan was to create blocks of 150 words with an overlap of 30 words between blocks.

First, we need to split the text into individual words.

words = unstructured_text.split()
total_words =  len (words)

print ( f"The text is split into  {total_words}  words." )
# Visualize the first 20 words
print ( f"First 20 words:  {words[: 20 ]} " )

### Expected Output###
# The text is split into 324 words.
# First 20 words: ['Marie', 'Curie,', 'born', 'Maria', 'Skłodowska', 'in', 'Warsaw,', 'Poland,', 'was', 'a', 'pioneering', 'physicist', 'and', 'chemist.', 'She', 'conducted', 'groundbreaking', 'research', 'on', 'radioactivity.']

The output confirms that our text has 324 words and shows the first few words. Now, let's apply the chunking logic.

We will iterate over the list of words, and each time we get chunk_size words, then back off overlap words to start the next block.

chunks = []
start_index =  0
chunk_number =  1

print ( f"Starting block processing..." )

while  start_index < total_words:
    end_index =  min (start_index + chunk_size, total_words)
    chunk_text =  " " .join(words[start_index:end_index])
    chunks.append({ "text" : chunk_text,  "chunk_number" : chunk_number})

    # print(f"Created chunk {chunk_number}: words {start_index} to {end_index-1}") # Uncomment to see detailed log

    # Calculate the starting index of the next block
    next_start_index = start_index + chunk_size - overlap

    # Ensure processing progress
    if  next_start_index <= start_index:
        if  end_index == total_words:
             break # The last part has been processed
        # If there is no progress and it is not at the end, go forward at least one word
        next_start_index = start_index +  1

    start_index = next_start_index
    chunk_number +=  1

    # Safe interrupt (optional)
    if  chunk_number > total_words:  # Simple safety measure
        print ( "Warning: The number of chunking loops exceeds the total number of words and has been interrupted." )
        break

print ( f"\nThe text was successfully split into  { len (chunks)}  chunks." )

#### Expected Output####
# Start chunking...
#
# The text is successfully split into 3 blocks.

The text is successfully segmented into 3 chunks. Let’s use Pandas DataFrame to see the size and content of these chunks.

print ( "--- Block Details---" )
if  chunks:
    # Create a DataFrame for better visualization
    chunks_df = pd.DataFrame(chunks)
    chunks_df[ 'word_count' ] = chunks_df[ 'text' ].apply( lambda  x:  len (x.split()))
    # In the Jupyter environment, display() will display the DataFrame in a more beautiful table format
    # If in a normal Python script, you can use print(chunks_df[['chunk_number', 'word_count', 'text']])
    try :
        display(chunks_df[[ 'chunk_number' ,  'word_count' ,  'text' ]])
    except  NameError:  # 'display' may be undefined in non-Jupyter environments
        print (chunks_df[[ 'chunk_number' ,  'word_count' ,  'text' ]])
else :
    print ( "No chunks created (text may be shorter than chunk size)." )
print ( "-"  *  25 )

(Here you would expect a table with 3 rows and columns chunk_number, word_count, textThe first two blocks word_count The first one is 150, and the last one is 84 (324 - (150-30) - (150-30) = 84). )

The table clearly shows our 3 chunks. Note that the first two chunks are exactly 150 words, and the last one contains the remaining 84 words. Now we have manageable snippets that we can submit to the LLM.

LLM Prompt Template

Prompt engineering is a critical step in the process of building a knowledge graph. Getting good results from LLM depends heavily on giving clear and precise instructions - the LLM Prompt Template .

We need to tell it explicitly what we want (SPO triples), and how we want it formatted (a specific JSON structure).

We will create two parts for our prompt:

1. System Prompt: Set the overall role and context for the LLM (e.g., “You are a knowledge graph extraction expert”).
2. User Prompt: Contains specific task instructions and the actual text block to be processed.

The image above shows two boxes. Box 1 is labeled "System prompt" and contains text like "You are an expert..." Box 2 is labeled "User prompt" and contains text like "Extract SPO triples... Rules: ... Text: {text_chunk} ... Required JSON format: ... Your JSON:". An arrow points from the text chunk data to the {text_chunk} placeholder in the user prompt box.

Here are the key rules we emphasize in our user tips :

• Extract **Subject-Predicate-Object** triples.
• Outputs only a valid JSON array [...]Do not include any additional text, explanations, or markdown code fences (such as json ... ).
• Each element in the array must be an object:{ "subject": "...", "predicate": "...", "object": "..." }.
• Keep the predicate concise (1-3 words, verbs preferred).
• All subject, predicate, and object values must be lowercase. This helps with subsequent standardization.
• Parse pronouns (e.g. 'she', 'her') and replace them with the names of the specific entities they refer to (e.g. 'marie curie').
• Be specific (e.g. if the text mentions 'nobel prize in physics', extract that, not just 'nobel prize').
• Try to capture all the different facts.

Let's define these prompts in Python.



# --- System Prompt: Sets the context/role for the LLM ---
extraction_system_prompt =  """
You are an AI expert specialized in knowledge graph extraction.
Your task is to identify and extract factual Subject-Predicate-Object (SPO) triples from the given text.
Focus on accuracy and adhere strictly to the JSON output format requested in the user prompt.
Extract core entities and the most direct relationship.
"""

# --- User Prompt Template: Contains specific instructions and the text ---
extraction_user_prompt_template =  """
Please extract Subject-Predicate-Object (SPO) triples from the text below.

VERY IMPORTANT RULES:
1. **Output Format:** Respond ONLY with a single, valid JSON array. Each element MUST be an object with keys "subject", "predicate", "object".
2. **JSON Only:** Do NOT include any text before or after the JSON array (eg, no 'Here is the JSON:' or explanations). Do NOT use markdown ````json ... ```` tags.
3. **Concise Predicates:** Keep the 'predicate' value concise (1-3 words, ideally 1-2). Use verbs or short verb phrases (eg, 'discovered', 'was born in', 'won').
4. **Lowercase:** ALL values for 'subject', 'predicate', and 'object' MUST be lowercase.
5. **Pronoun Resolution:** Replace pronouns (she, he, it, her, etc.) with the specific lowercase entity name they refer to based on the text context (eg, 'marie curie').
6. **Specificity:** Capture specific details (eg, 'nobel prize in physics' instead of just 'nobel prize' if specified).
7. **Completeness:** Extract all distinct factual relationships mentioned.

**Text to Process:**

{text_chunk}

Let's output and verify that its contents are as expected, including an example of what the user prompt will look like when we insert our first block of text.

print ( "--- System prompts---" )
print (extraction_system_prompt)
print ( "\n"  +  "-"  *  25  +  "\n" )

print ( "--- User prompt template (structure) ---" )
# Display the structure, replacing placeholders for clarity
print (extraction_user_prompt_template.replace( "{text_chunk}" ,  "[... text chunk goes here...]" ))
print ( "\n"  +  "-"  *  25  +  "\n" )

# Show example of the *actual* prompt that will be sent for the first chunk
print ( "--- Example of user prompt after filling (for block 1) ---" )
if  chunks:
    example_filled_prompt = extraction_user_prompt_template. format (text_chunk=chunks[ 0 ][ 'text' ])
    # For brevity, only a portion is shown
    print (example_filled_prompt[: 600 ] +  "\n[... the rest of the block text...]\n"  + example_filled_prompt[- 200 :])
else :
    print ( "No blocks available to create a populated prompt example." )
print ( "\n"  +  "-"  *  25 )

#### Expected Output####
# --- System prompts ---
# You are an AI expert specializing in knowledge graph extraction... (Complete system prompts)
# -------------------------
#
# --- User prompt template (structure) ---
# Please extract the subject-verb-object (SPO) triples from the following text.
# **Very important rule:**
# [... rules print here...]
# **Text to be processed:**
# [... text blocks go here...]
# **Your JSON output:**
# -------------------------
#
# --- Example of populated user prompt (for block 1) ---
# Please extract the subject-verb-object (SPO) triples from the following text.
# ... (rule) ...
# **Text to be processed:**
# Marie Curie, born Maria Skłodowska in Warsaw, Poland, was a pioneering physicist and chemist.
# She conducted groundbreaking research on radioactivity. Together with her husband, Pierre Curie,
# she discovered the elements polonium and radium. Marie Curie was the first woman to win a Nobel Prize,
# the first person and only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize
# in two different scientific fields. She won the Nobel Prize in Physics in 1903 with Pierre Curie
# and Henri Becquerel. Later, she won the Nobel Prize in Chemistry in 1911 for her work on radium and
# polonium. During World War I, she developed mobile radiography units, known as 'petites Curies',
# [... the rest of the block text...]
# **Your JSON output:**
# -------------------------

The prompts are clearly structured and accurate. We are ready to send these to the LLM.

Get triples from LLM

Next, we'll extract the triples from the LLM API. We'll iterate over each block of text, format the user prompt with the block's text, send both the system prompt and the user prompt to LLM via the API, and then attempt to parse the JSON response it returns.

We will track the results of successful extractions as well as any failed blocks.

The diagram above shows a loop. Starts at "Text Block". Arrow points to "Format Prompt (System + User + Block)". Arrow points to "Send to LLM API". Arrow points to "Receive Response". Arrow points to "Parse JSON". Arrow points to "Validate Triplet". Arrow points to "Store Valid Triplet". An arrow points back to the beginning to process the next block. There is a small box next to it for "Handle Errors/Failures".

Let's initialize the list that will be used to store the results.

# Initialize a list to store results and failure records
all_extracted_triples = []
failed_chunks = []

# Assume the client has been correctly initialized according to the previous 'Configuring Our LLM Connection' section
# For example: client = openai.OpenAI(api_key=api_key, base_url=base_url)
# For code operability, add a simple client initialization here (need to be replaced with the actual one)
try :
    client = openai.OpenAI(api_key=api_key, base_url=base_url)
except  Exception  as  e:
    print ( f"Unable to initialize OpenAI client:  {e} " )
    print ( "Please make sure your API key and base URL are set correctly." )
    client =  None # Mark client as invalid

print ( f"Start   extracting triplets from { len (chunks)} chunks, using model ' {llm_model_name} '..." )
# We will process the blocks one by one in the following cells.

Start extracting triplets from 3 blocks, using model 'deepseek-ai/DeepSeek-V3'...

Ok, let’s tackle the first chunk (remember, the full notebook goes through all chunks, but for clarity we’ll only show the detailed steps for one chunk here).

chunk_index =  0 # process only the first chunk  

if  client  and  chunk_index <  len (chunks):  # Check if the client is valid
    chunk = chunks[chunk_index]
    print ( f"\n--- Processing chunk  {chunk[ 'chunk_number' ]} / { len (chunks)}  ---" )
    prompt = extraction_user_prompt_template. format (text_chunk=chunk[ 'text' ])
    raw_response =  None # Initialize the raw response variable
    parsed_data =  None # Initialize the parsed data variable
    triples_in_chunk = []  # Initialize the triple list of the current block

    try :
        print ( "1. Format user prompt..." )
        print ( "2. Send request to LLM..." )
        # Call LLM, including system prompts and user prompts
        res = client.chat.completions.create(
            model=llm_model_name,
            messages=[{ "role" :  "system" ,  "content" : extraction_system_prompt},
                      { "role" :  "user" ,  "content" : prompt}],
            temperature=llm_temperature,
            max_tokens=llm_max_tokens,
            # Try to require JSON output (if the model and API support it)
            response_format={ "type" :  "json_object" },
        )
        print ( "LLM response received." )
        print ( "3. Extract the original response content..." )
        raw_response = res.choices[ 0 ].message.content.strip()
        print ( f"\n--- Raw LLM output (chunk  {chunk[ 'chunk_number' ]} ) ---" )
        print (raw_response)  # Print the raw output to see
        print ( "-"  *  15 )

        print ( "\n4. Trying to parse JSON from response..." )
        # Try parsing JSON directly, since we requested the json_object format
        try :
            parsed_data = json.loads(raw_response)
            # Some models may wrap the list under a key, try to extract it
            if isinstance (parsed_data,  dict ):
                 # Find the first item in the dictionary that is a list
                potential_list =  next ((v  for  v  in  parsed_data.values()  if isinstance (v,  list )),  None )
                if  potential_list  is not None :
                    parsed_data = potential_list
                else :
                    # If no list is found, but the dictionary looks like a single triple, put it into a list
                    if all (k  in  parsed_data  for  k  in  [ 'subject' ,  'predicate' ,  'object' ]):
                         parsed_data = [parsed_data]
                    else :
                         # If the dictionary is not a triple and is not a wrapper containing a list, it is considered invalid
                         print ( "Warning: Received dictionary, but neither a triple nor a wrapper around a list of triples." )
                         parsed_data = []  # set to an empty list

            if not isinstance (parsed_data,  list ):
                print ( f"Warning: Parsing result is not a list but  { type (parsed_data) } . Trying to find a list." )
                parsed_data = []  # If it is not a list, it is considered invalid

            print ( f" Successfully parsed JSON list (or converted/found). Contains  { len (parsed_data)}  items." )
            # Optional: Print the parsed data structure
            # print(f" --- Parsed JSON data (chunk {chunk['chunk_number']}) ---")
            # print(json.dumps(parsed_data, indent=2, ensure_ascii=False)) # Use ensure_ascii=False to correctly display non-ASCII characters
            # print("-" * 15)

        except  json.JSONDecodeError  as  json_e:
            print ( f" Direct JSON parsing failed:  {json_e} " )
            print ( " Trying to extract JSON array using regular expression ..." )
            # If direct parsing fails (for example, the model adds text before and after the JSON), try to extract it using regular expressions
            match  = re.search( r'\[.*?\]' , raw_response, re.DOTALL)  # Search [...] structure
            if match :
                try :
                    parsed_data = json.loads( match .group( 0 ))
                    print ( f"The JSON list was successfully extracted and parsed by the regular expression. Contains  { len (parsed_data)}  items." )
                except  json.JSONDecodeError  as  regex_json_e:
                    print ( f" Failed to parse JSON from regular expression match:  {regex_json_e} " )
                    parsed_data = []  # parsing still failed
            else :
                print("   未找到符合 [...] 格式的 JSON 数组。")
                parsed_data = [] # 未找到匹配

        print("\n5. 验证结构并提取三元组...")
        ifisinstance(parsed_data, list):
             valid_triples_count = 0
             for item in parsed_data:
                 # 检查是否是字典且包含所有必需的键，并且值是字符串
                 ifisinstance(item, dict) andall(k in item andisinstance(item[k], str) for k in ['subject', 'predicate', 'object']):
                     # 添加来源块编号
                     item_with_chunk = dict(item, chunk=chunk['chunk_number'])
                     triples_in_chunk.append(item_with_chunk)
                     valid_triples_count += 1
                 else:
                     print(f"   警告：跳过无效项目：{item}")
             print(f"   在此块中找到 {valid_triples_count} 个有效三元组。")
             if triples_in_chunk:
                 # 使用 Pandas 显示提取的三元组（如果可用）
                 try:
                     print(f"   --- 提取的有效三元组 (块 {chunk['chunk_number']}) ---")
                     display(pd.DataFrame(triples_in_chunk))
                 except NameError:
                     print(pd.DataFrame(triples_in_chunk))
                 all_extracted_triples.extend(triples_in_chunk)
        else:
            print("   未能获取有效的 JSON 列表，无法提取三元组。")
            failed_chunks.append({'chunk_number': chunk['chunk_number'], 'error': '未能解析出有效的 JSON 列表', 'response': raw_response})

    except Exception as e:
        print(f"处理块 {chunk['chunk_number']} 时发生错误：{e}")
        failed_chunks.append({'chunk_number': chunk['chunk_number'], 'error': str(e), 'response': raw_response or'请求失败，无响应'})

    # 打印当前累计结果
    print ( f"\n--- The total number of triples currently extracted:  { len (all_extracted_triples)}  ---" )
    print ( f"--- Number of failed chunks so far:  { len (failed_chunks)}  ---" )
    print ( f"\nFinished processing this block." )

elif not  client:
    print ( "Error: LLM client not initialized. Unable to process block." )
else :
    print ( "Block index out of range or no block." )

After running the above loop, the program will start extracting entities and other information used to build the knowledge graph, thereby building the knowledge graph.

Here is an example of the process when processing a single block of data:

===== Processing data block 1/3 =====
1.  Formatting user prompts...
2.  Send a request to the large language model...
   Large language model response received.
3. Extract the raw response content...
=====

===== Large language model raw output (data block 1) =====
[
{ "subject": "marie curie", "predicate": "born as", "object": "maria skłodowska" },
{ "subject": "marie curie", "predicate": "born in", "object": "warsaw, poland" },
{ "subject": "marie curie", "predicate": "was", "object": "physicist" },

# [... more primitive triples...]

{ "subject": "marie curie", "predicate": "born to", "object": "family of teachers" }
]

=====

4.  Try parsing the JSON from the response...
   Successfully parsed a JSON list directly.
   ===== Parsed JSON data (data block 1) =====
   [
   {
   "subject": "marie curie",
   "predicate": "born as",
   "object": "maria sk\u0142odowska"
   },

# [... more parsed triples...]

{
"subject": "marie curie",
"predicate": "born to",
"object": "family of teachers"
}
]

=====

5.  Verify the structure and extract triples...
   18 valid triplets were found in this data block.
   ===== Valid triples extracted (data block 1) =====
   subject predicate object chunk
   0 marie curie born as maria skłodowska 1
   1 marie curie born in warsaw, poland 1
   2 marie curie was physicist 1

# [... more triplets to display in the dataframe...]

## 17 marie curie born to family of teachers 1

===== Total number of triplets extracted: 18 =====
===== Number of data blocks failed so far: 0 =====

The data block has been processed.

We sent our first chunk of data and received a response, which we parsed successfully into JSON format. The raw output shows that the Large Language Model (LLM) followed the instructions well and returned a list of dictionaries. We then verified the data and displayed it clearly in a table - 18 facts were extracted from this first chunk alone!

(Note: The above code only runs the first chunk. A full run would process all chunks and accumulate more triplets.)

Next, we summarize the overall results after processing all data chunks (based on the complete notebook run).

# ===== Extraction process summary (reflects the status after single data block demonstration/or complete run) =====
print ( f"\n===== Overall extraction summary=====\n" )
print ( f"Total number of defined data chunks:  { len (chunks)} " )
print ( f"Number of chunks processed (tried to process):  { len (chunks)} " )  # The chunks we loop over
print ( f"The total number of valid triplets extracted from all processed data blocks:  { len (all_extracted_triples)} " )
print ( f"Number of data chunks where API calls or parsing failed:  { len (failed_chunks)} " )

if  failed_chunks:
    print ( "\nFailed data block details: " )
    failed_df = pd.DataFrame(failed_chunks)
    display(failed_df[[ 'chunk_number' ,  'error' ]])  # Clearly display the failed data block
    # for failure in failed_chunks:
    # print(f" data chunk {failure['chunk_number']}: error: {failure['error']}")
print ( "-"  *  25 )

# Display all extracted triplets using Pandas
print ( "\n===== All extracted triples (before normalization) =====\n" )
if  all_extracted_triples:
    all_triples_df = pd.DataFrame(all_extracted_triples)
    display(all_triples_df)
else :
    print ( "Failed to extract any triples successfully." )
print ( "-"  *  25 )

The output is as follows:

===== Overall Extraction Summary=====
Total number of blocks defined: 3
Number of blocks processed (tried to process): 3
Total number of valid triplets extracted from all processed data blocks: 45 # <--- Total number of examples
Number of data blocks where API calls or parsing failed: 0
-------------------------

===== All extracted triplets (before normalization) =====
subject predicate object chunk
0 marie curie born as maria skłodowska 1
1 marie curie born in warsaw, poland 1

# [... more triplets from all chunks...]

## 44 marie curie had daughters irène 3

OK, after processing all the data chunks (in a full run), we have a merged list of all the triples found by the LLM. This is a good start, but you may notice some potential overlap or slight differences in representation. Now, it's time to clean up the data!

Normalization and deduplication

The raw output from LLM is already pretty good, but often needs further optimization. We will perform a few simple cleanup steps:

1. Normalize: Remove extra spaces at the beginning and end of the subject, predicate, and object. We have previously asked LLM to output lowercase, but we force it again here just in case.
2. Filter: Remove triples whose subject, predicate, or object is empty after cleaning (for example, LLM returns blank content).
3. De-duplicate: Remove identical triplets. These duplicates may come from overlapping data blocks or different expressions in the text.

Now we start initialization.

# Initialize list and tracking variables

normalized_triples = []
seen_ triples = set() # Used to track (subject, predicate, object) tuples
original _count = len(all_ extracted _triples)
empty_ removed _count = 0
duplicates_removed_count = 0

print(f"Start normalizing and de-duplicating {original_ count} triples...")

#### Output####

Starting to normalize and deduplicate 45 triples... # <--- Total number of examples

Now, we will iterate over the original all_extracted_triples list, apply the cleaning step, and keep only unique, valid triplets.

We will print out the first few transformations to show what is happening.

print ( "Processing triples (showing the first 5):" )

for  i, t  in enumerate (all_extracted_triples):
    # Extract the subject, predicate, and object; remove leading and trailing spaces and convert to lowercase; if it is not a string, set it to an empty string
    s, p, o = [t.get(k,  '' ).strip().lower()  if isinstance (t.get(k),  str )  else '' for  k  in  [ 'subject' ,  'predicate' ,  'object' ]]
    # Replace multiple spaces in the predicate with a single space
    p = re.sub( r'\s+' ,  ' ' , p)

    # Make sure that the subject, predicate, and object are not empty
    if all ([s, p, o]):
        key = (s, p, o)  # Create a key for checking for duplicates
        if  key  not in  seen_triples:  # If this triple is new
            normalized_triples.append({ 'subject' : s,  'predicate' : p,  'object' : o,  'source_chunk' : t.get( 'chunk' ,  '?' )})  # Add to the result list
            seen_triples.add(key)  # Record it to avoid duplication
            if  i <  5 :  # print the first 5 processing information
                print ( f"\n# {i+ 1 } :  {key} \nStatus: Reserved" )
        else :  # if it is a duplicate
            duplicates_removed_count +=  1
            if  i <  5 :  print ( f"\n# {i+ 1 } : repeat - skip" )
    else :  # If there is an empty part after cleaning
        empty_removed_count +=  1
        if  i <  5 :  print ( f"\n# {i+ 1 } : invalid - skip" )

print ( f"\nProcessing completed. Total:  { len (all_extracted_triples)} , Retained:  { len (normalized_triples)} , Duplicates:  {duplicates_removed_count} , Empty values:  {empty_removed_count} " )

After running the loop for triple generation, the output we get is:

Processing triplets for normalization (first 5 examples shown):

===== Example 1 =====
Original triple (chunk 1): {'subject': 'marie curie', 'predicate': 'born as', 'object': 'maria skłodowska', 'chunk': 1}
After normalization: Subject = 'marie curie', Predicate = 'born as', Object = 'maria skłodowska'
Status: Reserved (new unique triplet)

===== Example 2 =====
Original triple (chunk 1): {'subject': 'marie curie', 'predicate': 'born in', 'object': 'warsaw, poland', 'chunk': 1}
After normalization: Subject = 'marie curie', Predicate = 'born in', Object = 'warsaw, poland'
Status: Reserved (new unique triplet)

... Finished processing 45 triples. # <--- Total number of examples

As you can see from the above example, each triple is checked. If it is valid (not empty after cleaning) and we have not recorded this exact fact before, we keep it.

Let's summarize how many triples we had initially, how many we removed, and show the final cleaned-up list.

# ===== Normalization Summary =====
print ( f"\n===== Normalization and deduplication summary=====\n" )
print ( f"Number of original extracted triplets:  {original_count} \n" )
print ( f"Number of triplets removed because they contained empty/invalid parts:  {empty_removed_count} \n" )
print ( f"Number of duplicate triples removed:  {duplicates_removed_count} \n" )
final_count =  len (normalized_triples)
print ( f"The final number of unique, normalized triplets:  {final_count} \n" )
print ( "-"  *  25 )

# Use Pandas to display a sample of normalized triples
print ( "\n===== Final normalized triple=====\n" )
if  normalized_triples:
    normalized_df = pd.DataFrame(normalized_triples)
    display(normalized_df)
else :
    print ( "No valid triples remain after normalization." )
print ( "-"  *  25 )


#### Output####
===== Summary of Normalization and Deduplication=====
Number of originally extracted triplets:  45
Number of triplets removed because they contained empty/invalid parts:  0
Number of duplicate triplets removed:  3 # <--- Example: Some duplicates found
Final number of unique, normalized triplets:  42 # <--- Final number of examples
-------------------------

===== Final normalized triplet=====
        subject predicate                      object  source_chunk
0    marie curie born  as            maria skłodowska             1
1    marie curie born  in              warsaw, poland             1
# [... show only unique, clean triplets...]
41   marie curie had daughter named eve             3
-------------------------

Now we have a clean, unique list of facts (triples) that we can use to build our graph.

Creating a Graph Using NetworkX

Now let’s assemble the knowledge graph! We will use networkx This Python library creates a directed graph (DiGraph). Our cleaned triples will be mapped to the graph structure in the following way:

• Each unique subject becomes a node.
• Each unique object becomes a node.
• Each triple (subject, predicate, object) becomes a directed edge from the subject node to the object node , and the predicate serves as the label of this edge.

This diagram shows 2-3 SPO triplets on the left (e.g., (Marie Curie, discovered, radium), (Marie Curie, received, Nobel Prize in Physics)). The corresponding graph structure elements are shown on the right: nodes for "Marie Curie", "Radium", and "Nobel Prize in Physics". An edge from "Marie Curie" to "Radium" labeled "discovered". An edge from "Marie Curie" to "Nobel Prize in Physics" labeled "received". The arrows clearly show the mapping of the parts of the triples to the graph elements.

First, let's create an empty graph object.

# Create an empty directed graph
knowledge_graph = nx.DiGraph()

print ( "An empty NetworkX DiGraph has been initialized." )
# Visualize the initial empty graph state
print ( "===== Initial image information=====\n" )
try :
    # Try using a newer version of the method
    print (nx.info(knowledge_graph))
except  AttributeError:
    # Alternative method for compatibility with different NetworkX versions
    print ( f"Type:  { type (knowledge_graph).__name__} " )
    print ( f"Number of nodes:  {knowledge_graph.number_of_nodes()} " )
    print ( f"Number of edges:  {knowledge_graph.number_of_edges()} " )
print ( "-"  *  25 )

As expected, our graph is currently empty.

Now, we will go through normalized_triples list and add each triplet as an edge (and its corresponding node) to the graph.

We will periodically print updates to show the graph's growth.

print ( "Adding triples to NetworkX graph..." )

added_edges_count =  0
update_interval =  10 # The frequency of updating the printed graph information

if not  normalized_triples:
    print ( "Warning: No normalized triplets could be added to the graph." )
else :
    for  i, triple  in enumerate (normalized_triples):
        subject_node = triple[ 'subject' ]
        object_node = triple[ 'object' ]
        predicate_label = triple[ 'predicate' ]

        # When adding edges, nodes are automatically added, but explicit calls to add_node are also possible 
        # knowledge_graph.add_node(subject_node)
        # knowledge_graph.add_node(object_node)

        # Add a directed edge with a predicate as the 'label' attribute
        knowledge_graph.add_edge(subject_node, object_node, label=predicate_label)
        added_edges_count +=  1

        # ===== Visualizing Growth =====
        if  (i +  1 ) % update_interval ==  0 or  (i +  1 ) ==  len (normalized_triples):
            print ( f"\n=====   Graph information after adding the {i+ 1 } th triple ===== ( {subject_node}  ->  {object_node} )" )
            try :
                # Try using a newer version of the method
                print (nx.info(knowledge_graph))
            except  AttributeError:
                # Alternative method for compatibility with different NetworkX versions
                print ( f"Type:  { type (knowledge_graph).__name__} " )
                print ( f"Number of nodes:  {knowledge_graph.number_of_nodes()} " )
                print ( f"Number of edges:  {knowledge_graph.number_of_edges()} " )
            # For very large graphs, printing information too frequently may be slow, so adjust update_interval.

print ( f"\nComplete adding triples. A total  of {added_edges_count}  edges have been processed." )


#### Output####
Adding triples to NetworkX graph... ")

===== Add the graph information after the 10th triple ===== (marie curie -> only woman to win nobel prize twice)
Type: DiGraph
Number of nodes: 11
Number of sides: 10

===== Added graph information after the 20th triple ===== (pierre curie -> was professor of physics)
Type: DiGraph
Number of nodes: 24
Number of sides: 20

We iterate over the cleaned list of triplets and add each one as an edge to networkx In the picture.

The output shows that the number of nodes and edges in the graph is steadily increasing.

Let’s take a look at the final statistics and look at a sample of some of the nodes and edges in the graph we built.

# ===== Final Graph Statistics =====
num_nodes = knowledge_graph.number_of_nodes()
num_edges = knowledge_graph.number_of_edges()

print ( f"\n===== Final NetworkX graph summary=====\n" )
print ( f"Total number of unique nodes (entities):  {num_nodes} " )
print ( f"Total number of unique edges (relations):  {num_edges} " )

if  num_edges != added_edges_count  and isinstance (knowledge_graph, nx.DiGraph):
     print ( f"Note: {added_edges_count}  edges  were added  , but there are only {num_edges}  edges in the graph. DiGraph will overwrite edges with the same source and destination nodes. If you need to keep multiple edges with the same direction, use MultiDiGraph." )

if  num_nodes >  0 :
    try :
       density = nx.density(knowledge_graph)  # Graph density: measures the density of the graph
       print ( f "Density:  {density: .4 f} " )
       if  nx.is_weakly_connected(knowledge_graph):  # Weakly connected: Ignoring the direction of the edge, are all nodes in the graph reachable from each other?
           print ( "The graph is weakly connected (ignoring direction, all nodes are reachable)" )
       else :
           num_components = nx.number_weakly_connected_components(knowledge_graph)  # Number of weakly connected components
           print ( f"The graph contains  {num_components}  weakly connected components." )
    except  Exception  as  e:
        print ( f"Unable to calculate some graph indicators:  {e} " )  # Handle errors that may occur with empty graphs or small graphs
else :
    print ( "The graph is empty, unable to calculate metrics." )
print ( "-"  *  25 )

# ===== Node Sample =====
print ( "\n===== Node sample (first 10) =====\n" )
if  num_nodes >  0 :
    nodes_sample =  list (knowledge_graph.nodes())[: 10 ]
    display(pd.DataFrame(nodes_sample, columns=[ 'Node sample' ]))
else :
    print ( "There are no nodes in the graph." )

# ===== Edge Samples =====
print ( "\n===== Edge samples (first 10, with labels) =====\n" )
if  num_edges >  0 :
    edges_sample = []
    # Extract the first 10 edges and their data (including labels)
    for  u, v, data  in list (knowledge_graph.edges(data= True ))[: 10 ]:
        edges_sample.append({ 'Source node' : u,  'Target node' : v,  'Label' : data.get( 'label' ,  'N/A' )})
    display(pd.DataFrame(edges_sample))
else :
    print ( "There are no edges in the graph." )
print ( "-"  *  25 )

We have used networkx The graph structure was successfully built in memory. You can see the total number of unique entities (nodes) and relationships (edges) in the graph and get a rough idea of what they look like.

Using ipycytoscape to implement interactive graphs

Now, for the cool part - visualizing our graph! We will use ipycytoscape Create an interactive visualization directly in this notebook.

First, a quick check to see if we actually have a graph that we can visualize.

print ( "Preparing interactive visualization..." )

# ===== Check if graph is valid for visualization =====
can_visualize =  False
if 'knowledge_graph' not in locals ()  or not isinstance (knowledge_graph, nx.Graph):
    print ( "Error: 'knowledge_graph' not found or is not a NetworkX graph object." )
elif  knowledge_graph.number_of_nodes() ==  0 :
    print ( "NetworkX graph is empty and cannot be visualized." )
else :
    print ( f"The graph appears valid and can be visualized ( {knowledge_graph.number_of_nodes()}  nodes,  {knowledge_graph.number_of_edges()}  edges)." )
    can_visualize =  True

#### Output####
Preparing interactive visualizations... ")
The graph seems valid and can be visualized (35 nodes, 42 edges).

ipycytoscape Requires graph data in a specific format (a list of dictionaries containing node information, and another list of dictionaries containing edge information, similar to JSON).

Let's convert now networkx At the same time, we will also calculate the degree of the node (that is, how many connections each node has) so that we can adjust the node size based on the degree later.

cytoscape_nodes = []
cytoscape_edges = []

if  can_visualize:
    print ( "Converting nodes..." )
    # Calculate node degree for adjusting node size
    node_degrees =  dict (knowledge_graph.degree())
    max_degree =  max (node_degrees.values())  if  node_degrees  else 1 # Find the maximum degree to avoid division by zero

    for  node_id  in  knowledge_graph.nodes():
        degree = node_degrees.get(node_id,  0 )
        # Simple node size scaling logic (adjustable as needed)
        node_size =  15  + (degree / max_degree) *  50 if  max_degree >  0 else 15

        cytoscape_nodes.append({
            'data' : {
                'id' :  str (node_id),  # ID must be a string
                'label' :  str (node_id).replace( ' ' ,  '\n' ),  # Displayed label (replace spaces with newlines)
                'degree' : degree,  # store degree information
                'size' : node_size,  # Stores the calculated node size for style settings
                'tooltip_text' :  f"Entity:  { str (node_id)} \nDegree:  {degree} " # Tooltip information displayed when the mouse hovers
            }
        })
    print ( f" { len (cytoscape_nodes)}  nodes converted  ." )

    print ( "Switching sides..." )
    edge_count =  0
    for  u, v, data  in  knowledge_graph.edges(data= True ):
        edge_id =  f"edge_ {edge_count} " # unique edge ID
        predicate_label = data.get( 'label' ,  '' )  # Get the edge label (predicate)
        cytoscape_edges.append({
            'data' : {
                'id' : edge_id,
                'source' :  str (u),  # source node ID (must be a string)
                'target' :  str (v),  # target node ID (must be a string)
                'label' : predicate_label,  # label displayed on the edge
                'tooltip_text' :  f"Relationship:  {predicate_label} " # Tooltip information displayed when the mouse hovers
            }
        })
        edge_count +=  1
    print ( f" { len (cytoscape_edges)}  edges converted  ." )

    # Combine into the final data structure
    cytoscape_graph_data = { 'nodes' : cytoscape_nodes,  'edges' : cytoscape_edges}

    # Visualize the structure after transformation (first few nodes/edges)
    print ( "\n===== Cytoscape node data samples (first 2) =====\n" )
    # Use json.dumps to beautify the print
    import  json
    print (json.dumps(cytoscape_graph_data[ 'nodes' ][: 2 ], indent= 2 , ensure_ascii= False ))
    print ( "\n===== Cytoscape edge data samples (first 2) =====\n" )
    print (json.dumps(cytoscape_graph_data[ 'edges' ][: 2 ], indent= 2 , ensure_ascii= False ))
    print ( "-"  *  25 )
else :
     print ( "Skipping data conversion because graph is invalid." )
     cytoscape_graph_data = { 'nodes' : [],  'edges' : []}  # Make sure the variable exists

Data conversion is complete. We have organized the node and edge data into ipycytoscape The required format, which contains the calculated node size and useful hover tooltip information.

Next, we create the actual visualization widget object and load our graph data into it.

We can define the appearance of nodes and edges through a CSS-like style syntax. Let's define a nice, colorful, and interactive style. We'll resize nodes based on their degree, change color on hover or selection, and add labels to the edges.

Now let's render the widget. You should be able to see the interactive knowledge graph in the output area below the cell.

OK, the key information about this knowledge graph image is as follows:

• Central Entity: “marie curie” is the main node located in the center.
• Relationships: The orange arrows (edges) show the connection from “marie curie”.
• Predicate label: The text on the edge (e.g., “discovered,” “won,” “was”) defines the type of relationship.
• Connected entities: The nodes at the ends of arrows are objects or related entities (e.g., “radium”, “polonium”, “physicist”).
• SPO triples: Each arrow represents a (subject, predicate, object) fact extracted by LLM.
• Visual Summary: This diagram provides a quick, structured overview of facts about Marie Curie.
• Graph structure: This is a radial structure with Marie Curie at the center.

Future directions for exploration

This process is a great starting point, but there is always room for improvement and further exploration:

• More robust error handling: Enhance the stability of LLM calls by introducing a retry mechanism or improving the handling of consecutive failed data blocks.
• More advanced normalization techniques: Beyond simple string matching. One can implement entity linking (associating "Marie Curie" and "M. Curie" to the same real-world entity ID) or relationship clustering (grouping similar predicates, such as "was born in" and "born at", together).
• Explore Hint Engineering in depth: Experiment! Try different LLM models, tweak hint instructions, change the temperature parameter (which controls the randomness of the model output), and see how the results change.
• Establish an evaluation system: What is the quality of the extracted triples? You can implement evaluation methods to measure precision and recall.
• Create richer visualizations: Use different colors or shapes for different types of nodes (people, places, concepts). Add more interactive features or graph analysis results directly in the visualization.
• Use graph analysis to dig deeper information : networkx It can be used to find the most important nodes (through centrality analysis, which measures the importance of nodes in the network), discover paths between entities, or identify community structures in the graph.
• Data persistence storage: Save the graph! For large projects and more complex queries, the extracted triples or graph structures can be stored in a specialized graph database, such as Neo4j.

I hope this article can provide you with valuable reference.