Talking about the current To C GUI Agent products

Explore the new breakthroughs of To C GUI Agent products, the innovative experience of Zhipu AutoGLM Meditation Edition.
Core content:
1. The unique functions of AutoGLM PC browser operation
2. Comparison of similar products and the unique advantages of Zhipu products
3. Discussion on the implementation of GUI Agent model and future prospects
Today, Zhipu released the AutoGLM Meditation Edition, in which the PC version can directly operate the user's browser to browse some websites that cannot be obtained through search engines and compliant crawlers. In my opinion, this seems to be the first serious To C GUI Agent product, and it is adapted to both Windows and Mac. I think it is easy to use. So this is a good opportunity to talk about related products.
First of all, this article is not a business order of Zhipu, and I was not invited to attend Zhipu's live press conference today (unlike some other KOLs). So this article is completely my personal opinion.
1. Similar products
On the one hand, there are earlier products such as ByteDance's UI-TARS-desktop, but it currently only has a Mac version, and it looks very much like an academic prototype. The user experience is still different from the AutoGLM PC version. Of course, it does not mean that Zhipu's effect significantly exceeds the academic prototype, but it is a question of how much effort the development team has put into "making it available to all kinds of users." So this is actually not a standard for technical effects, but a standard for product experience.
On the other hand, traditional RPA software is also transforming towards AI, such as creating a one-time RPA workflow directly through prompts and then executing it, which is actually similar to the experience of this type of product. However, this type of AI RPA product is not aimed at DeepResearch scenarios, but more at auxiliary and automatic operation scenarios.
In addition to these, I have also heard that some To B teams’ solutions have similar functions.
1.1. User experience of Zhipu AutoGLM Meditation Edition
Many users may want to compare it with ODR (OpenAI DeepResearch), but my trial experience shows that I can't count on this at all. Rather than saying it is like ODR, I think it is closer to the Operator function of ChatGPT. In fact, its advantage is that it can directly operate the browser and visit some websites that cannot be accessed by search engines or ODR crawlers.
And from my observation of its model itself, its intelligence is not comparable to ODR's o3. For users who can already use other DeepResearch products, it is more like a simple one-time AI RPA tool. Don't have too high expectations for its DeepResearch function. But if your research information source relies heavily on websites such as Xiaohongshu, JD.com, and Zhihu, then it may be more suitable.
To operate the browser, you need to install its Chrome plug-in. During operation, it will occupy a browser window and may open multiple tabs in it. However, it will not affect other windows opened in the same browser, and the window can be placed in the background. However, when the user needs to log in or perform other operations, its prompts are only in its browser window, and there are no prompts in the AutoGLM application. This needs to be noted.
2. Implement discussion
2.1. About GUI Agent Model
I haven't studied the implementation of AutoGLM and UI-TARS-desktop in detail, but it seems that they should get the web page DOM, and they may also get the web page screenshot for processing. At present, this kind of software is still based on the browser and cannot operate any application on the PC.
At present, many entrepreneurs hope to achieve similar functions. However, I think this type of model itself actually depends on its data synthesis. In my opinion, they are closer to the state of "data is model". A major problem in this regard is the lack of data and the high cost of synthesis.
I have been studying data synthesis in this area a year ago. In my opinion at that time, the cost of synthesis was almost unrealistic. Now with the improvement of the model's image understanding ability, data synthesis in this area has become more promising. But I still think that this may not be something that can be done well by a small amount of fine-tuning of the data level. For the understanding and adaptation of common applications and websites, it is likely that data in this direction will need to be added for learning in the pre-training or post-training stage. It may not be optimistic to rely solely on external third parties for fine-tuning.
At present, I think VLM models have not yet entered the reasoning era. Their capabilities may be more derived from the solidification of their capabilities during the training period. If they can analyze elements such as GUI during the reasoning process to better understand the interface of the application/webpage, the success rate and generalization ability should be further improved. But now I have not seen this ability in the output thinking process of any VLM model.
2.2. About User Account
Currently, AutoGLM uses the user's local browser, which is different from OpenAI's Operator. One of its benefits is that it can prevent users from logging into their accounts on other browsers, reducing the probability of detection from the website side.
There is no unified solution for this kind of GUI Agent to log in to the website account, and I don’t see any motivation for websites to actively support these Agents. So the current method of using the user’s local browser and cookies seems to be a good transitional solution.
2.3 About Long Context
At present, overseas models have good support for Long Context, but there is still some distance to go for domestic models. Especially when operating the browser, more information will be added to the Context, which further increases the pressure on the model to process the Context.
In the short term, it may be a better approach to split some tasks into independent contexts and then return only the results themselves. In this way, this independent link is a tool for the main process. On the one hand, it can execute multiple requests in parallel, and on the other hand, it can also reduce the pressure on the main process context.
However, this is a little bit less agent-friendly and less scalable. When the model's Long Context capability is improved, this process may no longer be necessary.
2.4. About the cost issue
At present, the cost of this type of GUI Agent is still not low. First, there are many steps in many operations, and second, many links require the processing and analysis of interface images. At present, the total reasoning cost in this regard is still relatively high. However, compared with last year, the cost has actually decreased due to the increase in success rate, but the unit reasoning cost of the model has not decreased significantly at present.
In addition to the inference cost, the user's waiting time and the time occupied by the cloud browser are also a cost that cannot be ignored. The current solutions seem to be slow and still consume a lot of time.
3. Outlook
Although the current usability is not particularly good, I think AutoGLM is still of great significance as the first To C GUI Agent product that is seriously available to users.
From what I have heard so far, it is expected that in another year, the model capabilities in this area will continue to improve significantly compared to now. However, it is still unknown whether they can be well generalized to software and web pages that they have not used. It is also uncertain how much they will cost to use.
In any case, Zhipu has set a target, or OpenAI's Operator has set a target, waiting for other model factories around the world to officially jump in and compete.
Considering that the model capabilities in this area are probably strongly dependent on the coverage of common application data, it is likely that domestic and overseas models have different focuses, and it may be difficult to rely on overseas models in domestic application scenarios. This is not only true for PCs, but also for mobile phones.
Whether various applications, websites, and apps can make good use of these GUI models seems to have become a strategic decision point. Online websites and applications may be another matter, but for pure end-side software that does not require an Internet connection, it is probably better to let these models get familiar with themselves. In fact, this is also a way of user training. If the cost of using these software can be reduced, more users will use these software.