Byte open source GUI Agent tool: UI-TARS full analysis, another Manus alternative
Updated on:July-10th-2025
Recommendation
ByteDance's open source UI-TARS opens a new era of GUI automation and explores a new realm of AI human-computer interaction.
Core content:
1. The background and significance of ByteDance's open source GUI Agent model UI-TARS
2. The core features of UI-TARS: perception, action and reasoning
3. Highlights of UI-TARS in technological breakthroughs: enhanced GUI screenshot perception, unified action modeling, etc.
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
It is reported online that ByteDance's internal Dev Infra team has developed a Dev Agent intelligent product for internal use in the company , which has similar functions to Manus. The Agent achieves research, development, data analysis and other tasks by integrating the intranet knowledge base and a variety of internal tools.Currently, the project is in the experimental stage and is only being tested internally for employees of the department. It is an internal tool and is not supported for external users.
Today, let’s talk about another open source project UI-TARS
UI-TARS is an open source GUI Agent model launched by ByteDance that can control the computer interface through natural language understanding and processing. This tool represents a major breakthrough in the field of artificial intelligence and human-computer interaction, providing users with a new and more natural way to operate computer systems.UI-TARS stands for User Interface - Task Automation and Reasoning System. It is an innovative native GUI agent model designed to enhance interaction with graphical user interfaces through advanced AI capabilities. Unlike traditional modular systems, UI-TARS integrates basic elements such as perception, reasoning, grounding, and memory into a unified visual-language model (VLM), achieving comprehensive task automation without relying on pre-established workflows or human intervention.
Comprehensive GUI Understanding : UI-TARS can interpret various types of input, such as text and images, to form a complete understanding of the user interface.
Dynamic interaction : The model can actively observe and respond in real time to changes in the ever-changing GUI environment.
High-density information processing : effectively handle complex layouts and multi-element interfaces, and extract accurate metadata.
Unified action space : Standardized action definitions across platforms (desktop, mobile, and web).
Precise positioning and interaction : Through large-scale action trajectory training, precise positioning and interaction of specific GUI elements can be achieved.
Platform specific actions : Support for additional actions like hotkeys, long presses, and platform specific gestures.
Send a twitter with the content "hello world"System 1 and System 2 reasoning: Combines fast, intuitive responses with thoughtful, high-level planning to handle complex tasks.
Task decomposition and reflection: Supports multi-step planning, reflection, and error correction to ensure robustness of task execution.
Decision-making based on "thinking": Generate a clear "thinking" process before each action, connecting perception and action with well-thought-out decisions.
Technological breakthroughUI-TARS has achieved technological breakthroughs in many areas:1. Enhanced GUI screenshot perception: trained with a large-scale dataset, specifically designed to extract metadata such as element type, bounding box, and text content.2. Unified action modeling: standardize semantically equivalent actions across platforms to improve multi-step execution capabilities.3. System 2 Reasoning: Injects various reasoning modes (such as task decomposition, long-term consistency, milestone identification, trial and error, and reflection) into the model.4. Reflective Online Trace Learning: Solve the data bottleneck problem by automatically collecting, filtering and reflectively refining new interaction traces.Get the current weather in SF using the web browserUI-TARS demonstrated excellent performance in multiple evaluations:In the OSWorld benchmark, the UI-TARS-72B achieved a score of 24.6 in 50 steps and 22.7 in 15 steps, better than the Claude's 22.0 and 14.9.
In AndroidWorld, UI-TARS achieved a score of 46.6, surpassing GPT-4o's 34.5.
In VisualWebBench, UI-TARS-72B scored 82.8, higher than GPT-4o's 78.5.
A score of 38.1 (SOTA) was achieved on ScreenSpot Pro.
These results demonstrate the superior capabilities of UI-TARS in perception, grounding, and GUI task performance.ByteDance provides users with the UI-TARS Desktop application, a GUI agent application based on UI-TARS (Visual-Language Model) that allows users to control computers using natural language.UI-TARS Desktop can do some configuration before useAll UI-TARS related resources are open source:[UI-TARS](https://github.com/bytedance/UI-TARS) [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop)While UI-TARS represents a significant advancement in the field of GUI agents, future developments point to integrating active and lifelong learning, allowing agents to autonomously drive their own learning through ongoing real-world interactions. This will minimize human intervention while maximizing generalization capabilities.UI-TARS is ByteDance's revolutionary innovation in the field of GUI agents, which achieves performance beyond existing systems by integrating perception, action, reasoning, and memory capabilities into a scalable and adaptive framework. Its open source release not only pushes the boundaries of AI-driven automation, but also makes it an accessible resource for further exploration and development. UI-TARS represents a shift from rule-based systems to adaptive native models, laying a solid foundation for the future development of GUI Agents.