Analysis of CUA technology behind OpenAI's intelligent agent Operator

Written by

Caleb Hayes

Updated on:July-17th-2025

Zhipu Agent GLM-PC, ByteDance, and Tsinghua UI-TARS were just released . Today, OpenAI also released the first AGI L3-level intelligent agent Operator , an agent that can perform tasks on the Internet for you , use its own browser, and can view web pages and interact with them through input, clicks, and scrolling.

Operator is powered by a new model called Computer-Using Agent (CUA) , which combines the visual capabilities of GPT-4o with advanced reasoning capabilities achieved through reinforcement learning, and is trained to interact with GUIs - the buttons, menus, and text boxes that people see on the screen.

Based on the user’s instructions, CUA operates through an iterative cycle that integrates perception, reasoning, and action.

So what technical challenges need to be addressed to build an open source Computer-Using Agent ?

Security : Isolate operating systems in a secure, controlled environment
Click Things : Enable AI to click precisely to manipulate UI elements
Reasoning : Let the AI decide what to do next (or when to stop) based on what it sees
Deploy LLM : Host open source models in a cost-effective manner
Streaming display : Find a low-latency way to display and record sandbox video

Challenge 1: Security

The ideal environment for running AI Agents should be easy to use, high performance, and secure. Giving AI Agents direct access to your personal computer and file system is very dangerous! It may delete files or perform other irreversible actions.

Challenge 2: Click on things

When the interface was text-based, "computer use" based on LLM was fairly simple, and a great deal of progress could be made using only text-based commands.

But some applications simply cannot be used without a mouse, so for an Agent to fully use the computer, this function is necessary.

Visual LLM,can output the exact coordinates of the reference input image, both Gemini and Claude have this capability .

Challenge 3: Reasoning

The power of LLM-based agents is that they can decide between multiple actions and use the latest information to make intelligent decisions.

Over the past year, the LLM's ability to make these decisions has been gradually enhanced. The first approach was to simply prompt the LLM to output an operation in a given text format and add the result of the operation to the chat history before calling the LLM again. All subsequent approaches are roughly the same, using fine-tuning to supplement the system prompts, and this general ability is called function calling .

Combine vision in a single LLM call to guide the tool using open source models you can try:

Llama-3.2-90B-Vision-Instruct: View the sandbox display and decide what steps to take next
Llama 3.3-70B-Instruct: Make decisions based on Llama 3.2 and reformulate them in a tool-usage format
OS-Atlas-Base-7B: is a tool that Agent can call to perform click operations according to prompts

Challenge 4: Deploy LLM

The agent needs to run fast, so when running LLM reasoning in the cloud, it is also expected to provide out-of-the-box functionality.

Llama 3.2 and 3.3 are all good choices, as are OpenRouter, Fireworks AI, and the official Llama API.

Challenge 5: Streaming Display

In order to see what the AI is doing, you want to get real-time updates from the sandbox screen.

server:

ffmpeg -f x11grab -s 1024x768 -framerate 30 -i $DISPLAY -vcodec libx264 -preset ultrafast -tune zerolatency -f mpegts -listen 1 http://localhost:8080

client:

ffmpeg -reconnect 1 -i http://servername:8080 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -f mpegts -loglevel quiet - | tee output.ts | ffplay -autoexit -i -loglevel quiet -

The first command basically creates a video streaming server over HTTP that can stream to one client at a time. The second command captures the stream and simultaneously writes it to a .ts file and then displays it in the GUI.

This works just fine over the internet. The server is some sort of built-in functionality of FFmpeg, but has a limitation in that it can only stream data to one client at a time. Therefore, the client must use the tee command to split the data stream so that it can be saved and displayed.

The whole process of OpenAI agent Operator merging PDF tasks

https://blog.jamesmurdza.com/how-i-taught-an-ai-to-use-a-computerhttps://openai.com/index/computer-using-agent/