Measures the difficulty of the problem being solved by the agent independently

Written by

Clara Bennett

Updated on:July-08th-2025

This article attempts to answer this question from one dimension: which problems are easier for LLM/Agent to solve and which are more difficult to solve. Of course, there are more than one factor that affects this question, and this article only introduces one of the main dimensions: the size of the necessary exploration space for problem solving.

This article does not give a quantitative indicator, because it is difficult to define a quantitative indicator that is applicable to all scenarios. However, there are traces to follow in constructing quantitative indicators in specific fields based on the ideas in this article.

1. The exploration space needed to solve the problem

The concepts discussed in this article are somewhat abstract, so let's start with a concrete example:

Consider building a railway through a mountainous and canyon-filled area. The ideal situation is to go straight through, cutting through mountains and building bridges across valleys, but due to engineering and technical limitations, we cannot dig tunnels in all places, nor can we build bridges in all places. At the same time, we also have a certain degree of budgetary constraints. Although we do not strive for the best, we must at least stay within the budget and not choose an obviously non-optimal solution. Of course, this example is too concrete, and there are many things that need to be discussed for experts. In this section, everyone should think from the perspective of a layman.

What is an acceptable solution? Here are a few points: (1) The overall solution should be connected and accessible, with no discontinuities or areas where the railway cannot pass. (2) Each part should be constructible and the structure should be stable. (3) All the low-cost solutions should be evaluated and the one with the best overall cost should be selected.

If we regard this as a program generation task, we need to pursue the following results: (1) each part is achievable and executable, (2) the input and output of each part can be properly connected, (3) the program as a whole can complete the target task, and (4) the complexity and execution cost of the program should be optimal.

Most current agents solve new tasks in the following ways: (1) gradually decompose complex tasks into simple tasks and solve them one by one. XAgent in 2023 follows this idea and uses a two-layer planning approach, but it does not handle the connection between subtasks well. (2) Similar to AutoGPT, it starts from a starting point and keeps exploring until a successful path is found.

In abstract terms, when solving a problem, one needs to continuously explore or decompose the task in the solution space until a solution is found that satisfies the following conditions: it is feasible overall, achievable locally, and can be strung together well. It would be even better if the cost is also better.

Then the minimum exploration space required for a problem, or the minimum exploration range under common ideas, becomes a measure of the difficulty of solving the problem.

2. Limitations of LLM/Agent

Whether it is a human or an LLM/Agent, the scale of the exploration space that can be controlled is limited. Relatively speaking, at present, if a human works hard and cooperates with external notes, the exploration space that can be handled is larger.

Relatively speaking, the space that LLM/Agent can handle is smaller, which is affected by several factors: (1) The Long Context capability of LLM (or the LLM that Agent relies on) is not large enough; (2) LLM is not accustomed to performing long reasoning and backtracking to try other routes; (3) LLM cannot almost losslessly streamline invalid trial paths and compress the context size of the working set like humans; (4) LLM's ability to switch back and forth in complex scenarios where multiple lines are progressing simultaneously is also worse.

Among them, (2-4) can also be processed by the upper-level code above LLM, but currently the application layer has not been able to come up with an effective and universal solution, and often we still have to rely on the progress of the model layer.

Not only is the scale of the exploration space that LLM/Agent can control smaller than that of humans, but it also does not take into account its own poor exploration capabilities when exploring. When solving some problems, LLM generally does not consider building a simplest MVP first, confirming that the core functions are completed, and then enriching its unnecessary details. This is often encountered when using AI coding tools for a new build: LLM gives a relatively complete implementation for each part right from the start, and then quickly reaches the limit of its control over the exploration space, and is overwhelmed by the unnecessary complexity it has piled up before it has explored a complete path to success . If LLM is a person with poor physical strength and poor climbing ability, it just likes to imitate others to climb some mountains that it cannot control.

Because LLM/Agent currently has poor ability to handle complexity and lacks understanding of the complexity it can handle, in terms of AI Coding, experienced people are needed to design exploration plans for it first to ensure that each subtask is within the complexity it can handle and that it can move in the correct, feasible direction without backtracking and trial and error exploration. This is also a reflection of the current LLM/Agent's lack of ability to independently handle new complex problems.

3. How to optimize

First of all, we will continue to optimize the Long Context capabilities of LLM, including: (1) reducing costs, (2) being able to more accurately extract the information needed for the current location from longer and more complex contexts, including those after attempting to backtrack on incorrect paths, and (3) better memory solutions and multi-branch exploration solutions suitable for agent scenarios.

In addition, how to make LLM more familiar with the scale of the exploration space that it can currently control, and when dealing with complex problems, give priority to reducing the introduction of unnecessary complexity, prioritize completing the task, and then refine the plan, is also an obvious point that needs to be improved. For the model layer, this is likely to require RL to make LLM familiar with its own capabilities. However, the amount of RL training required may be very large. For the application layer, some other means may be needed, such as using prompts to prevent LLM from introducing too much unnecessary complexity too early.

Related Papers

I have searched some papers related to the main body of this article for reference only.

A  Survey  on  Large Language Models for Automated Planninghttps://arxiv.org/abs/2502.12435
Self -Guiding Exploration for Combinatorial Problemshttps://arxiv.org/abs/2405.17950
Enhancing  LLM Reasoning with Reward-guided Tree Searchhttps://arxiv.org/abs/2411.11694