Agent Reasoning: Using Knowledge Graph Reasoning Tools for Deep Research—University of Oxford

Written by

Clara Bennett

Updated on:July-17th-2025

summary

In this technical report, we introduce a framework called agent-based reasoning that uses agents to enhance the reasoning capabilities of large language models (LLMs) by integrating external tools. Unlike traditional LLM-based reasoning methods that rely only on internal reasoning, agent-based reasoning dynamically combines web search, code execution, and structured reasoning-context memory to solve complex problems that require deep research and multi-step logical deduction. Our framework introduces mind-mapping agents, which build a structured knowledge graph to track logical relationships, thereby improving deductive reasoning capabilities. In addition, the integration of web search and encoding agents enables real-time retrieval and computational analysis, enhancing reasoning accuracy and decision-making capabilities.

Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks show that our approach significantly outperforms existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs. Furthermore, our results show that agentic reasoning improves expert-level knowledge synthesis, test-time scalability, and structured problem solving. The code is available at: https://github.com/theworldofagents/Agentic-Reasoning.

Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

https://arxiv.org/abs/2502.04644

Core Overview

Background

Research Problem : The problem this paper aims to solve is how to enhance the reasoning ability of Large Language Models (LLMs) so that they can handle complex research questions and multi-step logical reasoning.
Research difficulties : The research difficulties of this problem include: existing methods perform well in structured fields, but perform poorly in unstructured or subjective tasks; traditional methods lack detailed explanations of the reasoning process; how to perform effective reasoning and knowledge synthesis in uncertain tasks.
Related work : Related work on this issue includes models such as OpenAI's o1, Qwen-QwQ, and DeepSeek-R1, which demonstrate significant step-by-step reasoning capabilities in large-scale reinforcement learning, but lack transparency in the reasoning process and multi-step reasoning.

Research Methods

This paper proposes an Agentic Reasoning framework to address the problem of insufficient reasoning capabilities of LLM. Specifically,

Overall framework : Agentic Reasoning enhances the reasoning capabilities of LLM by integrating external tools using agents. The framework dynamically combines network search, code execution, and structured reasoning context memory to solve complex problems that require deep research and multi-step logical reasoning.
Mind Map Agent : Build a structured knowledge graph to track logical relationships and improve deductive reasoning ability. The Mind Map agent converts the original reasoning chain into a structured knowledge graph and generates concise topic summaries using community clustering and LLM.
Web Search Agent : Retrieve relevant information from the Internet to supplement the model’s knowledge and generate a concise restatement of the summary. The search agent extracts the webpage content that is most relevant to the current reasoning context and generates a summary using LLM.
Encoding agent : Delegates the encoding task to a dedicated encoding LLM, generates code, executes it, and returns the result. The encoding agent formats the encoding request to ensure seamless integration with the main reasoning model.

Experimental design

Dataset : Evaluation is performed on the GPQA dataset, which contains PhD-level scientific question answering questions in physics, chemistry, and biology. Experiments use the high-quality Diamond Set (198 questions) and the more extensive Extended Set (546 questions).
Experimental setup : We compared the performance of different methods on the GPQA dataset, including direct reasoning models, retrieval-enhanced reasoning models, and agentic reasoning models. We also conducted an evaluation on deep research tasks, inviting PhD experts in finance, medicine, and law to formulate professional research questions.
Parameter configuration : In the experiment, different LLM models and tool agents were used for comparison. The specific configurations included Qwen2.5-32B, QwQ-32B, Llama3.3-70B and other models.

Results and Analysis

Performance on the GPQA dataset : Agentic Reasoning achieves 88.1%, 58.3%, and 79.6% accuracy in physics, chemistry, and biology, respectively, significantly outperforming existing retrieval-enhanced generation models and closed-source LLMs.
Comparison with human experts : On the GPQA extended set, Agentic Reasoning outperforms human experts in all subjects, with 75.2% in physics, 53.1% in chemistry, and 72.8% in biology.
Deep Research Tasks : Agentic Reasoning performs well in deep research tasks in the fields of finance, medicine, and law, with higher accuracy than Gemini Deep Research Service.
Test-time scalability : Increasing the number of tool invocations can improve performance on the same problem, but too many tool invocations may indicate that the problem itself is challenging or ambiguous.
The role of Mind Map : Mind Map is particularly effective in clarifying complex logical relationships and enhancing deductive reasoning. It can help the model solve problems that traditional LLM often goes wrong.

Overall conclusion

This paper proposes the Agentic Reasoning framework to enhance the reasoning ability of LLM by integrating external tool agents such as Mind Map, web search, and coding agents. Experimental results show that Agentic Reasoning performs well in complex problem solving and deep research, significantly outperforming existing models. The framework improves logical coherence, factual accuracy, and deep research capabilities, laying the foundation for the application of AI systems in expert-level problem solving. Future work will explore the extension of the framework in multimodal data and real-time adaptability to further enhance AI's ability to cope with complex real-world challenges.

Paper Evaluation

Advantages and innovations

Introducing external tool usage agents : Agentic Reasoning enhances the reasoning capabilities of Large Language Models (LLMs) by integrating external tool usage agents such as web search and code execution.
Structured knowledge graph : The Mind Map agent builds a structured knowledge graph to track logical relationships and improve the ability of deductive reasoning.
Real-time retrieval and computational analysis : The integration of web search and code agents enables real-time retrieval and computational analysis, enhancing the accuracy of reasoning and decision-making capabilities.
Multi-step reasoning : The framework allows LLMs to plan and execute multi-step strategies, autonomously identify and retrieve necessary data, dynamically adapt to real-time information, and perform quantitative analysis to generate precise results.
Extensive Evaluation : We evaluate our approach on PhD-level scientific reasoning (GPQA) and domain-specific deep learning tasks, and show that our approach significantly outperforms existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs.
Improving expert-level knowledge synthesis : Results show that agentive reasoning improves expert-level knowledge synthesis, test-time scalability, and structured problem solving.

Shortcomings and reflections

Challenges of tool selection : Research has found that too many tool choices can reduce performance and increase the risk of selecting inappropriate tools. In addition, inaccuracies in the output of external tools can also negatively impact the overall response quality.
Non-textual modality handling : While additional tools are not significantly beneficial for language-based reasoning, tools for handling non-textual modalities such as financial data, medical images, and genetic data are essential. Developing specialized tools for different data modalities can further enhance LLM reasoning capabilities.
Scalability of test-time reasoning : Although we find that reasoning chains with more tool invocations tend to produce better results, across problems, those that require excessive tool usage often indicate inherent ambiguity or inaccuracy in the initial reasoning. This requires further research on how to optimize tool usage during reasoning.

Key questions and answers

Question 1: How does the Mind Map agent in the Agentic Reasoning framework work? What are its main functions?

The Mind Map agent is responsible for building and managing the real-time reasoning context of the reasoning model in the Agentic Reasoning framework. Specifically, the work of the Mind Map agent includes the following aspects:

Structured knowledge graph construction : The Mind Map agent converts the original reasoning chain into a structured knowledge graph. It uses the graph to build LLM to extract entities from the reasoning chain and identify the semantic relationships between related entities.
Topic Summary Generation : By applying a community clustering algorithm to the knowledge graph, the Mind Map agent clusters the reasoning contexts into different groups and generates concise topic summaries for each group using LLM.
Knowledge Graph Query : The Mind Map agent allows querying the knowledge graph through specific questions, such as "Who was Jason's mother's great-grandfather?" It uses standard Retrieval-Augmented Generation (RAG) techniques to retrieve relevant information on the knowledge graph and return the results.
Contextual support : The Mind Map agent provides contextual reasoning support to external tools, enabling them to generate more context-aware responses. In addition, when the reasoning model is uncertain about its claims or loses the thread during reasoning, it can query the Mind Map for relevant information and continue reasoning based on the retrieved answers.

These capabilities make Mind Map agents particularly effective in clarifying complex logical relationships and enhancing deductive reasoning, helping models solve problems where traditional LLMs often make mistakes.

Question 2: How does the Agentic Reasoning model perform on the GPQA dataset? What are its advantages over other models?

On the GPQA dataset, the Agentic Reasoning model significantly outperforms the existing retrieval-enhanced generation model and closed-source LLM. The specific performance is as follows:

Accuracy : Agentic Reasoning achieved 88.1%, 58.3% and 79.6% accuracy in physics, chemistry and biology, respectively. In comparison, other models such as Qwen2.5-32B, QwQ-32B and RAG-QwQ-32B achieved accuracies of 57.0%, 39.8% and 73.7%, respectively.
Comparison with human experts : On the GPQA extended set, Agentic Reasoning outperforms human experts in all subjects, 75.2% in physics, 53.1% in chemistry, and 72.8% in biology. This shows that Agentic Reasoning has a significant advantage in handling expert-level scientific reasoning tasks.
Case Study : The Agentic Reasoning model excels at handling complex medical decision-making problems. For example, the model can automatically execute code to calculate the optimal FiO2 for a patient, perform a web search to retrieve the most accurate PEEP value, and synthesize the results to determine the best treatment plan.

These advantages show that Agentic Reasoning significantly improves the accuracy and efficiency of reasoning by integrating external tool agents, especially when dealing with complex, expert-level problems.

Question 3: How does the Agentic Reasoning framework perform in deep learning tasks? What improvements are there compared to existing deep learning systems?

The Agentic Reasoning framework performs well in deep research tasks, as shown below:

Accuracy : In deep research tasks in the fields of finance, medicine, and law, Agentic Reasoning has higher accuracy than Gemini Deep Research Service. This shows that Agentic Reasoning has a significant advantage in generating high-quality research reports.
Task Completion : Agentic Reasoning can automate many hours of manual investigation work, significantly improving productivity in knowledge-intensive fields. This can handle complex research tasks more efficiently than existing deep research systems.
Structured reasoning : Agentic Reasoning builds a structured knowledge graph through the Mind Map agent, which enhances the logic and coherence of reasoning. This makes the model perform better when dealing with complex logical relationships and abstract concepts.
Scalability during testing : The Agentic Reasoning framework is able to improve the performance of the same problem by increasing the number of tool calls, demonstrating its flexibility and scalability in dealing with complex problems.

These improvements demonstrate that Agentic Reasoning not only excels in generating high-quality research reports, but can also handle complex research tasks more effectively through structured reasoning and tool enhancements.