16.3K stars! Microsoft open-sources AI Agent artifact OmniParser, making AI your computer operation expert

Microsoft open-sources AI artifact OmniParser, making AI your computer operation expert, with over 16.3K GitHub stars!
Core content:
1. OmniParser core capabilities: convert UI screenshots into structured formats, accurately understand and operate interface elements
2. V2 version performance improvement: 0.6 seconds/frame on A100, 0.8 seconds on RTX 4090, and an average accuracy of 39.6%
3. Environment construction and practical application: from cloning projects to creating Python environments, to installing core dependencies and downloading models
Microsoft officially released and open-sourced OmniParser V2
OmniParser can turn any Large Language Model (LLM) into an AI assistant that can use computers! The project has received over 16.3k stars on GitHub.
Today I will show you a complete experience of this artifact, from environment construction to practical application.
The following video demonstrates how to use OmniParser AI Agent to automate the posting of X posts:
OmniParser Core Capabilities
OmniParser is like giving AI a pair of "smart eyes". It can convert UI screenshots into a structured format, allowing AI to accurately understand and operate every element on the interface.
The V2 version has stronger performance: the processing speed reaches 0.6 seconds per frame on A100 and only takes 0.8 seconds on RTX 4090.
In the ScreenSpot Pro benchmark test, it achieved an average accuracy of 39.6%.
Supports mainstream large models, including OpenAI (GPT-4V), DeepSeek (R1), Claude 3.5 Sonnet, Qwen (2.5VL), and Anthropic Computer Use.
With the new OmniTool, you can even directly control your Windows 11 virtual machine!
Environment Preparation
First clone the project locally:
git clone https://github.com/microsoft/OmniParser.gitcd OmniParser
Create and activate the Python environment:
conda create -n "omni" python==3.12 conda activate omni
Install core dependencies:
pip install --upgrade huggingface_hub pip install gradio==4.14.0 pip install httpx==0.26.0 pip install httpcore==1.0.2 pip install anyio==4.2.0 pip install -r requirements.txt
Download model file
Create onedownload_models.py
script:
import osfrom huggingface_hub import hf_hub_download from pathlib import Pathdef download_omniparser_models():"""Download the model files of OmniParser V2"""try: base_path = Path("weights") base_path.mkdir(exist_ok=True) files = ["icon_detect/train_args.yaml","icon_detect/model.pt","icon_detect/model.yaml","icon_caption/config.json","icon_caption/generation_config.json","icon_caption/model.safetensors"]print("Start downloading model file...") for file in files:print(f"Downloading: {file}") hf_hub_download( repo_id="microsoft/OmniParser-v2.0", filename=file, local_dir=base_path ) icon_caption_path = base_path / "icon_caption"icon_caption_florence_path = base_path / "icon_caption_florence"if icon_caption_path.exists():if icon_caption_florence_path.exists():import shutil shutil.rmtree(icon_caption_florence_path) icon_caption_path.rename(icon_caption_florence_path)print("\nAll files downloaded!")except Exception as e:print(f"\nError during download: {str(e)}")print("Please check your network connection and try again")if __name__ == "__main__": download_omniparser_models()
Running the Demo Locally
python gradio_demo.py
After running, open the browser to access the local service (usually http://127.0.0.1:7860).
After uploading any interface screenshot and waiting for a short processing time (usually no more than 1 second), you can see the detailed analysis results, including interactive area annotations and function descriptions.
The effect is as follows, enter a picture:
Output icon marked results:
Structured JSON, which contains the element's content identification and specific coordinate values:
With these specific structured recognition results, the imagination space is infinite!
Cross-platform automation cases
Here we will implement a cross-platform automation solution: deploy the OmniParser service on the server, and then implement automation through the macOS client.
Server deployment:
from fastapi import FastAPI, UploadFilefrom PIL import Imageimport ioimport uvicorn app = FastAPI() @app.post("/analyze")async def analyze_screen(image: UploadFile):# Read the uploaded imageimage_data = await image.read() image = Image.open(io.BytesIO(image_data))# Process the image with OmniParser# Add OmniParser's processing logic herereturn {"elements": [...]}# Return the identified element informationif __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)
Client implementation (macOS):
import pyautoguiimport requestsfrom PIL import ImageGrabdef capture_screen():"""Get a screenshot"""screenshot = ImageGrab.grab()return screenshotdef convert_coordinates(omni_coords):"""Convert OmniParser coordinates to PyAutoGUI coordinates"""# Adjust the coordinate conversion logic according to actual situationreturn omni_coordsdef click_element(coords):"""Perform a click operation"""pyautogui.click(coords[0], coords[1])def main():# Get a screenshotscreenshot = capture_screen()# Send it to the OmniParser serverfiles = {'image': ('screenshot.png', screenshot)}response = requests.post('http://ubuntu-server:8000/analyze', files=files)# Process the response elements = response.json()['elements']# Perform the automated operationfor element in elements:coords = convert_coordinates(element['coords']) click_element(coords)if __name__ == "__main__": main()
The instructions are as follows:
- The server is responsible for image parsing: The OmniParser service deployed on the server is dedicated to image recognition tasks
- Client-side execution: The script on macOS is responsible for taking screenshots, sending requests, and performing actual mouse operations
- Cross-platform collaboration: Seamless cooperation between both ends through HTTP API
On this basis, we can further expand:
- Connect to large models such as GPT-4V to achieve natural language control
- Add more automated operations, such as keyboard input, drag and drop, etc.
- Realize operation recording and playback functions
- Add error handling and retry mechanism