STEVE: Use AI to train AI to create a smarter computer operation assistant to help you control the UI interface

Written by

Audrey Miles

Updated on:July-09th-2025

Recently, I read a very interesting paper called "STEVE: A Step Verification Pipeline for Computer-use Agent Training".

I believe that in the past few months, everyone has seen many cases where AI controls computers to help us do our work, such as Zhipu, manus, etc., but most of them are directed by calling APIs . If there is no API, can the graphical interface also be considered as part of AI understanding?

This paper talks about how to make AI agents better operate computer graphical interfaces (GUIs), just like humans click, input, and drag. AI has made rapid progress in this area, but training them is still a big problem, with expensive data, complex actions, and difficult evaluation.

In real-life scenarios, many people still do not know how to use APIs, or when there are no network APIs, GUI operations become another option for automated operations.

The contribution of this paper is that it proposes a new training method - STEVE, which can effectively improve the capabilities of AI agents.

Why is it so difficult to operate a proxy with a GUI?

To let AI operate in the GUI, to put it simply, is to let it look at the screen and then click, input, and scroll like a human. This seems simple, but in fact there are several major difficulties:

1. Difficulty in identifying UI elements : Modern software interfaces are ever-changing. Traditional OCR or detection methods have difficulty accurately understanding components such as buttons and menus, especially when switching between different applications, where irregular UI designs may be encountered.
2. Multi-step task planning : Some operations, such as "moving file A to folder B", need to be completed in several steps, and AI needs to understand long-term planning. This is not just a simple "click" operation, but involves multiple links such as path planning and interface recognition.
3. Expensive training data : The previous method relied on "behavior cloning", which is to let AI imitate human operation trajectories. However, this data is very expensive and there are many imperfect operations. There may be errors or invalid steps in human operation trajectories, which leads to unstable AI learning effects.
4. Environmental complexity : Different computer systems, software interfaces, display resolutions, etc. will affect the performance of AI. An AI trained on Windows 10 may fail on Windows 11.

What new things did they come up with?

The core idea of STEVE is to use GPT-4o as a "referee" to evaluate the operations of the AI agent step by step.

The whole process is as follows:

1. Data collection : They first let some suboptimal AI agents collect operational data covering various tasks (middle left part of the figure).

STEVE requires a large amount of high-quality GUI interaction data. The richness and accuracy of the data directly affects the ability of the final agent. The entire data collection part includes the following types of data:

The first step is to extract all UI components and UI boundaries from the DOM elements of the web page, and then use the OCR (optical character recognition) model to verify the text content of these UI elements (clean and filter the data) to ensure the accuracy of the data.

In addition to web page data, we also use existing UI parsing tools in a Windows virtual machine environment to obtain screenshots of desktop application interfaces and collect accessibility (A11y) data of UI elements, such as folders, txt files, word, excel, etc. There are also some screenshot annotations to ensure that the model can correctly identify various interactive components on the computer.

It is not enough to understand interface elements. STEVE also needs AI agents to learn how to operate computers. Therefore, the research team asked some "suboptimal agents" with weaker performance to perform various tasks on the Windows desktop and recorded their operation trajectories. These tasks include:

• File operations (rename, move, delete)
• Browser tasks (searching, clicking, scrolling)
• Application operations (opening software, adjusting settings)

These suboptimal agents are not perfect and often make mistakes when performing tasks. This is exactly what STEVE needs, as its goal is to teach AI agents how to distinguish between correct and incorrect actions.

Finally, the following types of data were collected:

2. Step verification : 4o is responsible for checking the correctness of each step and then marking it with a "right/wrong" label (the right part of the figure).

The specific process is as follows:

• The AI agent performs an action (e.g. clicks a button).
• Record screenshots before and after the operation and provide them to GPT-4o.
• GPT-4o evaluates whether this step is correct and gives a binary label of “correct” or “wrong”.

In this way, each trajectory is broken down into a series of labeled operation steps, which allows the AI agent to learn the task execution process more accurately rather than relying solely on the final task completion.

3. KTO training : Using the Kahneman-Tversky Optimization (KTO) optimization method, the AI agent can learn using both "right" and "wrong" feedback instead of just correct examples.

Compared with traditional reinforcement learning (RL) methods, STEVE does not need to manually design complex reward functions, but directly judges the quality of operations through 4o, which greatly reduces the difficulty of training. At the same time, 4o's visual ability can help AI agents identify UI elements more accurately, and can directly use error data and multiple cycles to improve the operation accuracy of AI agents.

What is the experimental effect?

They conducted a lot of experiments and the results were convincing:

• WinAgentArena review : Their AI agent surpassed the previous best method OmniParser in task completion rate in Windows environment, especially in tasks such as file management and web browsing, with the success rate increased by up to 22%.
• UI localization capability : Compared with traditional supervised fine-tuning (SFT), their method can more accurately identify and click UI elements, especially on high-resolution screens, reducing false clicks.
• Cost and efficiency : Compared to the OmniParser approach, their AI agent reasoned 10 times faster and 100 times cheaper.
• Versatility : The AI agents trained by STEVE are not only applicable to Windows tasks, but can also be extended to web page automation operations and even some GUI-based mobile application interactions.

What is the significance of this method?

The most important thing about this research is that they found a more efficient and scalable way to train computers to operate AI. Compared with traditional behavior cloning or reinforcement learning, their step verification method can filter out valid data more quickly, allowing AI to progress faster. In the future, this method can not only be used in desktop applications, but also be extended to a wider range of fields such as web automation and mobile UI interaction.

In addition, this method is also of great significance for large-scale AI training. Since GPT-4o is responsible for evaluating AI operations, each step can get quick feedback, unlike traditional reinforcement learning, which must wait until the task is completed to know whether it is right or wrong. This means that AI can learn how to operate the GUI efficiently in a shorter time, thereby shortening the training cycle and improving the feasibility of actual deployment.