Microsoft open-sources innovation framework: DeepSeek can be turned into an AI Agent

Microsoft's latest open source framework turns AI models into high-performance computer vision agents.
Core content:
1. Microsoft released the OmniParser V2.0 framework to improve the accuracy of AI models in UI element detection
2. OmniParser V2.0 combined with GPT-4o significantly improved the accuracy in high-resolution benchmarks
3. Omnitool open source tool helps convert large models into agents to achieve screen understanding and action execution
Microsoft released the latest version V2.0 of the visual agent parsing framework OmniParser on its official website, which can turn models such as DeepSeek-R1, GPT-4o, Qwen-2.5VL, etc. into AI Agents that can be used on computers.
Compared with the V1 version, V2 has higher accuracy and faster reasoning when detecting smaller interactive UI elements, and the latency is reduced by 60%. In the high-resolution Agent benchmark ScreenSpot Pro, the accuracy of V2+GPT-4o reached an astonishing 39.6%, while the original accuracy of GPT-4o was only 0.8% , which is a huge improvement overall.
In addition to V2, Microsoft also open-sourced omnitool, a Docker-based Windows system that covers functions such as screen understanding, positioning, motion planning and execution. It is also a key tool for turning large models into agents.
Open source address: https://huggingface.co/microsoft/OmniParser-v2.0
Github: https://github.com/microsoft/OmniParser/
https://github.com/microsoft/OmniParser/tree/master/omnitool
A brief introduction to OmniParser V2
At present, the key difficulty in turning a large model into an agent is that it needs to be able to reliably identify interactive icons in the user interface, understand the semantics of various elements in the screenshot, and accurately associate the expected actions with the corresponding areas on the screen.
V2 parses the user interface from pixel space into structured elements, allowing the large model to understand and operate these elements . This is somewhat similar to the word segmentation operation in natural language processing, but for visual information. In this way, the large model can perform retrieval-based prediction of the next action on the parsed interactive element set.
For example, when a large model needs to complete a complex web page operation task, V2 can help it identify elements such as buttons and input boxes on the web page and understand the functions of these elements such as login buttons and search boxes.
Large models can more accurately predict the next action to be performed, such as clicking a login button or entering keywords in a search box.
Simply put, you can think of V2 as the "eyes" of the big model, allowing it to better understand and operate complex user interfaces .
OmniTool is an integrated tool that can be used out of the box. It can turn models such as DeepSeek-R1, GPT-4o, Qwen-2.5VL into Agents. It consists of three major parts: V2, OmniBox and Gradio.
V2 has been introduced above. OmniBox is a lightweight Windows 11 virtual machine based on Docker. Compared with traditional Windows virtual machines, OmniBox takes up 50% less disk space while providing the same computer usage API.
Users can quickly build and run a test environment for GUI automation tasks with less resource consumption, which is very convenient for developers with limited hardware resources.
Gradio UI provides an interactive interface that helps developers easily interact with V2 and large models, and quickly test and verify the effects of automated tasks.
Gradio UI is very easy to use. You only need to start OmniBox and Gradio server on your local machine, and then access the interface provided by Gradio UI through a browser.
OmniParser Core Architecture
The core idea of OmniParser is to convert the visual information of the user interface into structured data that is easy to understand and operate. However, this process is relatively complicated and requires the collaboration of multiple modules to complete.
First, OmniParser needs to identify all interactive elements from the user interface screenshot, such as buttons, icons, and input boxes. These elements are the basis for users to interact with the interface, so accurately detecting them is a crucial first step.
Next, OmniParser must not only identify the location of these elements, but also understand their functionality and semantics. For example, an icon with three dots may mean "more options", while a magnifying glass icon may represent "search". This deep understanding of functionality enables the large model to more accurately predict the actions that users may need to perform.
To achieve these goals, OmniParser uses a multi-stage parsing process. In the first stage, the interactive area detection module uses deep learning technology to identify all possible interaction points from the user interface screenshots. The training dataset of this module contains 67,000 unique screenshots from popular web pages, each of which is annotated with the bounding box of the interactive area extracted from the DOM tree.
By training the model on this data, OmniParser is able to identify interactive elements on the screen with extremely high accuracy and assign a unique identifier to each element .
But simply identifying the location of interactive elements is not enough. In a complex user interface, a button may be similar in shape and color to other buttons, but its function may be completely different. Therefore, OmniParser has built-in functional semantics modules.
The goal of this module is to generate a text describing the function of each detected icon. Microsoft developed a dataset of 7,185 icon description pairs and fine-tuned it using the BLIP-v2 model to more accurately describe the semantic information of common application icons.
For example, instead of just describing an icon as “a circular icon with three dots,” it can now understand and generate a description like “an icon for accessing more options.”
The third important module of OmniParser is the structured representation and action generation module . This module integrates the outputs of the first two modules to form a structured, DOM-like UI representation. It not only contains screenshots with superimposed bounding boxes and unique IDs, but also contains semantic descriptions of each icon.
This can help models such as DeepSeek-R1, GPT-4o, and Qwen-2.5VL understand screen content more easily and focus on action prediction. For example, when the task is "click the settings button", OmniParser not only provides the bounding box and ID of the settings button, but also provides its functional description, which significantly improves the accuracy and robustness of the model.