Woter AI detection.Hurry - ends Jun 28th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Optimizing Knowledge Graph and LLM Interfaces: Breaking Through the Performance Bottleneck of Complex Reasoning

Written by

Caleb Hayes

Updated on:June-09th-2025

Introduction: The fusion challenge of knowledge graphs and large language models

The emergence of search-enhanced generation (RAG) technology provides a standard way to solve these problems. In a typical RAG process, the intensive searcher will select the relevant text context for a given query, then append the retrieved content to the query, and then process it by the LLM. This design improves the factual accuracy of the output and allows the model to refer to external information sources. However, standard RAG systems are often out of their mind when dealing with problems involving multi-step reasoning or requiring structured access to relational knowledge, and relying solely on dense or sparse document retrieval is not enough.

To address these challenges, the hybrid approach to integrating knowledge graphs (KGs) into RAG workflows has received increasing attention. These systems (sometimes called GraphRAG) use graph structures to represent relationship structures and support retrieval based on symbolic queries or multi-hop graph traversals. Graph-based retrieval provides LLM with access to explicit, structured contexts, showing great potential in tasks requiring deep inference.

However, both traditional RAG systems and graph-based RAG systems face the challenge of hyperparameter sensitivity. The performance of these systems is heavily dependent on a range of configuration choices, including text block size, retriever type, top-k thresholds, and prompt templates. As the system becomes more modular and complex, the number of tunable parameters increases, so does the interaction between them. Although hyperparameter optimization has been explored in standard RAG systems, its role in graph enhancement systems has not been fully studied.

This paper aims to fill this research gap. By conducting structured hyperparameter optimization research on graph-based RAG systems, focusing on tasks that combine unstructured input, knowledge graph construction, retrieval and generation. Our experiments are based on the Cognee framework, an open source modular system that supports end-to-end graph construction and retrieval. Cognee's modular design allows for clear separation and independent configuration of pipe components, making it ideal for controlled optimization research.

Related work

Progress and Challenges of RAG System

The search enhanced generation (RAG) system extends the language model through the search module, so that the output can be based on external knowledge. This basic two-stage architecture has become the de facto standard and many improvements have been proposed over time. Recent work includes Self-RAG, which enables LLMs to reflect on their output and trigger searches dynamically; and CRAG, which uses a search evaluator to filter low confidence documents and upgrade to web searches if needed.

Multiple jump question and answer

Double jump question and answer extends standard question and answer, requiring inference of multiple documents. Early datasets such as HotPotQA create such problems on Wikipedia through crowdsourcing. 2WikiMultiHopQA improves this by leveraging Wikidata relationships to enforce structured, verifiable inference paths. MuSiQue adopts a bottom-up approach to combine multi-step problems from single-hop primitives and filter out false shortcuts, providing a stronger benchmark for combinatorial reasoning.

Knowledge Graph Q&A

Knowledge graph Q&A (KGQA) systems answer questions through structured reasoning on graphs, increasingly integrating LLMs to bridge symbols and neural reasoning. RoG prompts LLMs to generate abstract relational paths that are instantiated by graph traversal before the final answer is generated. Other work includes a trainable subgraph retriever and decomposed logical reasoning chains on subgraphs, showing measurable improvements in interpretability and performance.

GraphRAG

GraphRAG generalizes RAG to any graph structure, extending its use beyond the knowledge base. Early systems such as Microsoft's summary pipeline used LLMs to build knowledge graphs, partitioned them using community detection, and summarized each component. Other variants use GNNs with subgraph selection, graph traversal agent, or personalized PageRank on patternless graphs. These systems cover a wide range of tasks, but share a common structure: dynamic subgraph construction, followed by prompt-based reasoning.

Optimizing the RAG system requires balancing search coverage, generation accuracy and resource constraints. Recent work has applied Bayesian optimization under budget constraints, using context as a tunable variable, and introducing full pipeline tuning through reinforcement learning. Multi-objective frameworks have also emerged to weigh accuracy, latency, and security. Although the methods vary, they are all designed to expose and control critical degrees of freedom in modern RAG pipelines.

Cognee: Automated knowledge graph construction framework

Cognee is an open source framework for end-to-end knowledge graph (KG) construction, retrieval and completion. It supports heterogeneous inputs (such as text, images, audio), from which entities and relationships are extracted, possibly with the help of ontology mode. The extraction process runs in a containerized environment, based on tasks and pipelines, each stage can be extended by configuration or code.

The default pipeline includes ingestion, chunking, extraction based on large language model (LLM), and indexing to graph, relationship and vector storage backends. After indexing, Cognee provides built-in components for retrieval and completion. The unified interface supports vector search, symbolic graph query and mixed graph-text methods. Completion is built on the same infrastructure, supporting prompt-based LLM interaction and structured graph query.

Cognee also includes a configurable evaluation framework for benchmarking retrieval and completion workflows. The framework is based on multi-hop question and answer and uses mature benchmarks (HotPotQA, TwoWikiMultiHop) to provide a structured evaluation environment for graph-based systems. The evaluation proceeds in different stages in sequence: start with corpus construction, followed by answers using contextual conditions for retrieval and completion of components. The answers are then compared to the gold reference and scored using multiple indicators. The final output includes a performance report with confidence scores.

Cognee's modularity enables targeted hyperparameter adjustments during the ingestion, retrieval and completion stages. The evaluation framework provides structured, quantitative feedback, so that the entire system can be regarded as an objective function. This setting allows the standard hyperparameter optimization algorithm to be directly applied.

Hyperparameter optimization settings

Optimization framework

Cognee exposes multiple configurable components that affect retrieval and generation behavior, including parameters related to preprocessing, retriever selection, prompt design, and runtime settings. To systematically evaluate the impact of these design choices, we developed a hyperparameter optimization framework called Dreamify.

Dreamify treats the entire Cognee pipeline as a parameterization process, including ingestion, chunking, LLM-based extraction, retrieval and evaluation. A single configuration defines the behavior of all stages. Each trial corresponds to a complete pipeline run, starting with corpus build and ending with evaluation of the benchmark dataset. The output is a scalar score based on multiple metrics (such as F1, exact match, or LLM-based correctness). These metrics are calculated as the average of all the questions in the dataset, returning values between 0 and 1.

Optimization is performed using the tree structure Parzen estimator (TPE). The algorithm is well suited for search spaces that combine classification and ordered integer value parameters. At this scale, grid search is impractical, and random searches do not perform well in early tests. Although TPE is sufficient for our experiments, other optimization strategies still need to be explored in future work.

Pipe behavior is deterministic to fixed configurations, although some components (such as graph builds generated by LLM) exhibit slight changes between runs. These differences do not substantially affect the overall evaluation score within a single configuration. The experiments are independent and reproducible.

Tunable parameters

The optimization process considers six core parameters, which affect document processing, retrieval behavior, prompt selection, and graph construction. Each parameter affects how information is segmented, retrieved, or used during answer generation.

Block size (chunk_size)

This parameter controls the number of marks used for segmented documents before graph extraction. In the Cognee pipeline, it affects the structure of the generated graph and the granularity of the available context during retrieval. The ranges used in this study (200-2000 markers) are based on preliminary test selection to balance extraction accuracy, search specificity, and processing time.

Search_type

This parameter determines how to select the context for answer generation. The cognee_completion strategy uses vector search to retrieve blocks of text and pass them directly to the language model. The cognee_graph_completion strategy retrieves knowledge graph nodes and their associated triplets by combining vector similarity and graph structure. The retrieved nodes are briefly described, and the surrounding triples are formatted into structured text. The structured format of retrieved nodes and triples emphasizes the relational context and may support more efficient multi-hop reasoning.

Top-K context size (top_k)

This parameter sets the number of items retrieved by each query. When using cognee_completion, it controls the number of text blocks; when using cognee_graph_completion, it controls the number of graph triples. The retrieved context is passed to the language model for answer generation. In our experiment, the values range from 1 to 20.

Q&A Tip Template (qa_system_prompt)

This parameter selects the instruction template used for answer generation. Templates vary in style and specificity, ranging from concise prompts to more detailed instructions, encouraging proofs or structured output. Prompt selection can affect answer format and factual accuracy.

Prompt template (qa_system_prompt, graph_prompt)

These parameters control the instruction templates used during answer generation and graph construction. For the Q&A, we evaluated three cues variants, mainly varying in tone and verbose length. While the basic instructions are consistent, more constrained and direct prompts often produce output that is more closely aligned with the expected answer format. This has a significant impact on the evaluation score, especially for exact matches and F1, although the correctness score is also affected to a lesser extent. For graph construction, three hints were also tested, which differed in how they guided the LLM to extract entities and relationships from text—either done in one step or through more structured, progressive instructions. This selection affects the granularity and consistency of the generated graph structures used during retrieval.

Task_getter_type

This parameter controls how to pre-process the Q&A during evaluation. While the system can support any pipeline variant, we focus on two representative configurations. In the first configuration, the document summary is generated during graph construction and is available for use by the searcher. In the second configuration, summary generation is omitted.

Experimental settings

We conducted a series of nine hyperparameter optimization experiments to evaluate the impact of configuration choice on Cognee end-to-end performance. Each experiment corresponds to a different benchmark data set and evaluation metric combination. The data sets used are HotPotQA, TwoWikiMultiHop, and Musique. Each experiment targets one of three indicators: accurate matching (EM), F1 or DeepEval's LLM-based correctness.

For each experiment, we created a filtered subset of the benchmark. The instances were randomly sampled and then manually reviewed before the experiment. We excluded examples that are not syntax-compliant, blurred, label errors, or are not supported by the provided context. Similar issues have been noted in previous literature. The result evaluation set consists of 24 training instances and 12 test instances per dataset. This filtering step is performed once before any adjustments to avoid bias or selection.

In each trial, all contextual paragraphs in the training set were used to construct the knowledge graph. This generates a single merged graph for each trial and is then used to answer all training questions. Pipeline structures are consistent across all datasets and metrics.

Each experiment consisted of 50 trials. In each trial, the optimizer sampled a configuration and performed a complete pipeline run including ingestion, graph construction, retrieval, and answer generation. The selected metric is calculated on all training problems and the result score is used as the target value of the trial. EM and F1 score deterministic calculation. The DeepEval correctness score requires a separate LLM-based evaluation step.

The experiments are run in order and do not parallelize. The execution time for each trial is approximately 30 minutes. The final result reports performance on the test set using the best performance configuration selected from the training. In addition to point estimation, we report confidence intervals calculated using nonparametric bootstrap resampling for a single question-and-answer pair.

Results and discussion

Training set performance

(a) Musique

(b) TwoWikiMultiHop

Figure 1: Operating maximum performance curves for Musique, TwoWikiMultiHop and HotPotQA.

Optimization results in consistent improvements in all datasets and metrics. While baseline settings are reasonable and manually selected, they are not adjusted for specific evaluation conditions. Relative improvements are usually significant, especially for exact matches, where several baselines are close to or exactly zero. This is mainly due to mismatch in answer styles: the system's default configuration is adjusted to more conversational output, while the benchmarks tend to be shorter, dryer answers. Given the strictness of EM as an indicator, even the factually correct response is often punished.

Although obvious improvements are shown, these results should be explained with caution.

Reserved set performance

To evaluate generalization capabilities, we evaluated the optimal configuration for each experiment on the retention test set. The benefits compared to baseline are still visible, but are less obvious than in training. Most metrics fell modestly, and in one case (F1 on TwoWikiMultiHop), the test performance slightly exceeded the training score. These results suggest that task-specific optimizations are reasonably generalized, even if applied to unseen examples from the same benchmark.

Some variability may be attributed to the small-scale and uneven quality of benchmark Q&A instances of the reserved set, which is a limitation noted in the literature. We use a simple training setup without early stop or regularization, which may also explain some of the observed degradation. However, the fact that improvements persist in most cases shows that even basic optimization processes can produce generalizable benefits. While this is not the main focus of this study, these results suggest that future work can explore stronger mechanisms of adjustment, especially on larger or domain-specific datasets.

Discussion

The optimization process uses a tree structure Parzen estimator (TPE), which is selected because it is able to navigate discrete and mixed parameter spaces. TPE is effective in identifying improved configurations, although trial-level performance is sometimes unstable. Optimization strategies that are more stable or expressive may yield more consistent results, and exploring these alternatives remains the direction for future work.

The experiment also emphasized the limitations of standard evaluation indicators. Exact matches and F1 often punish semantically correct but worded differently from reference outputs. In contrast, LLM-based correctness scores are more tolerant of vocabulary changes, but introduce their own inconsistencies. Several nearly verbatim answers received scores of less than full marks, indicating that LLM scorers introduced noise, especially around format sensitivity and implicit assumptions.

High-performance configurations usually share parameter settings, especially block sizes and retrieval methods. However, most effects are nonlinear and task-specific, and no single configuration performs best on all benchmarks. This highlights the importance of empirical adjustment in the search augmentation pipeline and shows that cross-task generalization requires adaptation, not just reuse.

Although complete generalization is beyond the scope of this study, the results support the claim that system adjustments are both realized and useful in practice. The observed benefits, although small in some cases, suggest that only configuration level changes can affect downstream performance. Retrieval enhancement systems benefit from targeted, task-aware adjustments, and performance-overfit trade-offs can be managed without significantly changing architectures or increasing complexity.

Conclusion

We demonstrate that systematic hyperparameter adjustment in graph-based retrieval enhancement generation systems can lead to consistent performance improvements. Cognee's modular architecture allows us to isolate and change configuration parameters in graph construction, retrieval, and prompts. Applied to three multi-hop question and answer benchmarks, this setting allows us to check how adjustments affect standard evaluation metrics. Although improvements were observed in each task, their magnitudes vary, and the benefits were often sensitive to both metrics and datasets.

Looking forward, there are several natural further working directions. Technically, the optimization process can be extended using alternative search algorithms, a wider parameter space, or multi-objective standards. Our assessment focuses on well-known Q&A datasets, but custom benchmarks and domain-specific tasks will help explore generalization capabilities. Graphics enhancement rankings of RAG systems or shared benchmark infrastructure can also support progress in this area.

While Q&A-based metrics provide practical means to evaluate pipeline performance, they do not fully capture the complexity of graph-based systems. Variability in the results between different configurations suggests that the benefits are unlikely to come from general adjustments only. Instead, our results point to the potential of task-specific optimization strategies, especially in environments where domain structures play a central role. We expect that future work at the intersection of academic and application backgrounds will find more opportunities for targeted adjustments.

More broadly, we think it is useful to view this process through a cognitive lens, a concept that describes how intelligence is embedded in a physical system. We see the development of frameworks such as Cognee as part of a broader shift to a system that reflects this paradigm, and their optimizations also play an important role. The cognition of these systems does not occur only through design, but through how they adjust, measure, and adapt over time.