Microsoft open-sources OmniParser V2, turning DeepSeek-R1 into AI Agents using computers~

Written by

Jasper Cole

Updated on:July-12th-2025

Microsoft released and open-sourced OmniParser V2 on its official website , which can turn any LLM into an agent capable of using a computer. GPT-4o, DeepSeek R1 , Sonnet 3.5, Qwen, etc. can be enabled to understand the content on the screen and take relevant actions.

OmniParser is a general-purpose screen parsing tool that interprets/converts UI screenshots into a structured format to improve existing LLM-based UI Agents .

The training dataset includes:

A dataset of interactive icon detections from popular web pages that are automatically annotated to highlight clickable and actionable areas;
Icon Description Dataset aims to associate each UI element with its corresponding functionality.

The model center contains a fine-tuned version of YOLOv8 and a fine-tuned base model of Florence-2 based on the above datasets.

What’s new in OmniParser V2?

Bigger, clearer icon titles + base datasets 60% latency improvement compared to V1.
Average latency: 0.6 seconds/frame on the A100, 0.8 seconds on a single 4090.
Strong performance: Average accuracy on ScreenSpot Pro was 39.6
Agents only need one tool: OmniTool. Control a Windows 11 VM with OmniParser + a visual model of your choice. OmniTool supports the following large language models out of the box - OpenAI (4o/o1/o3-mini), DeepSeek (R1) , Qwen (2.5VL), or Anthropic Computer Use.

https://huggingface.co/microsoft/OmniParser-v2.0https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/https://github.com/microsoft/OmniParser/tree/masterdemo： http://hf.co/spaces/microsoft/OmniParser-v2