First-hand experience, Doubao internal testing of shared screen calls

Written by
Jasper Cole
Updated on:July-12th-2025
Recommendation

Explore Doubao's large model's real-time screen sharing and analysis capabilities, and experience how AI understands and assists in handling daily tasks.

Core content:
1. Doubao's internal test of the shared screen call function experience
2. Examples of AI real-time analysis of screen content and providing assistance
3. Doubao's limitations and challenges in processing dynamic video content

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In October last year, ByteDance held an event related to the bean bag model.

During the event, they not only lowered the price of the large model, but also demonstrated the parsing capabilities of the Doubao Vision large model. At that time, I was surprised when I saw it: in the future, when I call AI, I can share the screen and it will help me understand everything. It's so fast.

However, what was shown at that time seemed more like a show-off demo and was never released to the public. Three months have passed, and during this period I saw them constantly conducting internal testing, and today I finally got the qualification to experience it.

What exactly is this ability?

Simply put, when you have a voice call with Doubao, a "Share Screen" button will appear on the interface; after clicking it, you can see the desktop content in real time, analyze the scene, provide suggestions and help solve substantive problems.

I am used to using shortcut keys, so I just press Option + S to call out the function. Of course, you can also click the button on the interface to use it, which mainly depends on personal habits.

First, I asked it: Can you see my screen?

It said: I can't see it. I thought, what's going on? After fiddling with it for a minute, I hung up the phone and turned it back on, and it worked. I guess it was a bug caused by system latency.

Next, I decided to try it out. At that time, there were three pages open on the screen: the first window was a chat with Kimi, the second was my WeChat group chat, and the third was a document I was writing.

I asked it: Can you see what's on my screen?

Doubao replied: There is a friend named Song Jingwei in your WeChat group. Judging from his name, he seems to be from Focus Media. He said in the group that 60 is the time to fight hard, and other friends are also discussing this.

I was a little surprised that the bean bun could be seen in such detail.

I then asked: "Can you see what Kimi and I are talking about?" It said: "Your Kimi window is open, and it looks like you have questions to ask. How about I help you think about it and tell it?" Indeed, I had Kimi's chat window open, but I hadn't typed anything. Is this to control Kimi?

Then, I glanced at the document on the right side of the screen and asked again: Can you see my document? He said: Yes, you are writing your experience with Doubao... Indeed, I am writing this experience document.

Finally, I opened Safari and searched Baidu; it said: I see you are searching for something, do you need help? I can help you think about it.

The system's interactive capabilities are very fast, like having eyes staring at the computer screen; I think it's really basic to just look at pictures, read chat logs, and read text. So, I tested its math capabilities.

I was asked to write a complex formula by hand, and my brain was obviously not enough. So I gave this problem to Kimi. After Kimi finished writing the problem, I copied it into the memo, and then asked Doubao to help me solve it through screen sharing.

Doubao only took about 3 seconds to give the answer: 18 yuan. In addition to the answer, he also explained his calculation process in detail, which was very smooth.

After experiencing this, a question suddenly occurred to me:

Since Doubao can recognize flat content on the screen, can it handle three-dimensional or dynamic content? For example: watching short videos.

So I opened a short video I shot a few days ago, about one minute long, and asked Doubao if it could help me "watch" the video. As expected, Doubao replied that it couldn't watch it directly. This is reasonable, after all, the video is dynamic and it may be difficult for it to parse in real time.

However, I didn't give up.

Open a video account, wait for the video content to finish playing, and then ask it, what is the video about? Doubao said, you are watching the content of a video account, and the content is mainly a blind date scene.

There are some pain points in this process. If the video is too long, Doubao may only listen to it for about a minute before it automatically stops and starts summarizing, and the video may not have finished playing at this time.

So, I tested four points: reading social chat records, operating Kimi, looking at pictures, and watching videos.

There are many products like screen sharing software, operating computer desktop, analyzing content and vision, such as Highlight AI.

It is a very powerful desktop AI tool. I have been using it for a long time. It can directly operate applications such as WeChat and Notion. The interaction is very smooth and can be operated directly with voice and custom shortcut keys. I can also let it help me extract the content of public accounts or translate the text on the screen.

Google's ScreenAI can mainly analyze icons, pictures and maps on the screen and generate summaries. I can also use it to share the design layout of a web page or answer questions in icons. It is very suitable for processing visual information.

There are also OmniParser and ChatGPT. Although these tools have different focuses, they all revolve around screen content sharing, operation, and analysis.

For a domestic user, I think the only disadvantages are: network problems, the experience is not good enough; sometimes, there will be frequent freezes or even interruptions during use; the appearance of Doubao AI assistant sharing screen voice happens to solve this problem.

I have been wondering what kind of scenarios this ability of Doubao is suitable for?

Later, I felt that the use of AI assistants cannot be directly defined by scenarios, because they are essentially more like an all-round agent. Agents should have broader capabilities and should not be limited to a specific scenario.

In addition, I believe that the challenge of AI assistants has shifted from "capability" to "interaction." This new interaction mode can be seen as a major upgrade of the Graphical User Interface (GUI).

Why do I say so?

In the past, when using a computer, it mainly relied on clicking icons, pressing buttons, searching menus, etc. Although this method seems intuitive, when there are more and more functions, the screen will become cluttered, the learning cost will increase, and it will become cumbersome to use.

In addition, every time we want to complete a task, we have to do it manually - click here and there, which is passive and time-consuming. For example, when handling multiple tasks at the same time, we need to switch back and forth between writing documents, looking up information, and viewing files, which is very inefficient and the experience is not smooth enough.

The emergence of AI assistants changed all that.

It can understand our language instructions. If we want it to do something, we just need to ask it directly, without having to remember complicated operating steps. If I want to record an idea, I just need to say "please write it down for me"; if I encounter a problem, I can say "help me solve it". The whole process is simple and efficient.

More importantly, the AI ​​assistant can automatically complete tasks in the background without us having to stare at the screen all the time. It can understand my intentions, break down complex tasks into multiple steps, and complete them step by step.

I have a profound experience: when I asked questions in the past, I needed to type word for word, and sometimes my train of thought was interrupted before I finished typing.

Now, through voice input, I can finish asking the question in one breath. Even if the expression is not complete, the AI ​​assistant can understand what I mean and help me solve the problem in sections.

In my opinion, AI TOC product managers should think deeply about a proposition: how to further optimize the interaction method and transform the relationship between users and people from traditional buttons to more natural conversations. Perhaps, only in this way can we truly achieve the transformation from "tool" to "intelligent partner".

What do you think?