After DeepSeek revealed the secrets of o1, Xiaohongshu seems to have cracked the technical route of o3 and also open sourced it~

Written by
Caleb Hayes
Updated on:June-18th-2025
Recommendation

The Xiaohongshu team cracked the o3 technical route and brought a revolutionary AI agent technology breakthrough!

Core content:
1. OpenAI o3 model technical route and feature analysis
2. DeepSeek R1 and o1 technical comparison and open source impact
3. How the Xiaohongshu team cracked the o3 puzzle and realized the intelligent agent framework that thinks while looking at pictures

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Since OpenAI became CloseAI, we can only see them releasing increasingly powerful models (GPT3.5 -> GPT4 -> 4o -> o1 -> 4.5 -> o3 -> 4.1), and we don’t know the technical route of their most advanced models.

For everyone, this becomes a decryption task. The puzzle needs to be solved first, and after there is an open source version, the followers can follow up frantically.

It has been a long time since I posted anything about algorithms. Currently, more people who frequently visit our account are probably non-technical people. This article will try to introduce this content in the simplest and most understandable way.

DeepSeek R1 was open-sourced at the beginning of this year, along with its model and paper, and became a global hit with the same intellectual level as O1.

After R1, various R1-VL models were open sourced to attract attention. Why do I say it is to attract attention? Because it is really just replacing the backbone with a VL model, and everything else is trained as a pure language model.

What is the strength of OpenAI o3? What is the difference with these R1-VL models? o3 is actually an intelligent agent that can autonomously call various tools to complete tasks. Think while looking at the picture. The image is not only input in the first round, but also intermediate pictures are generated in the middle and continue to be input into the VL model as information. The following figure is a partial thinking example:

DeepSeek R1 tells you what is the answer to the o1 puzzle?

It only needs to reward the correctness of the result after thinking, without the need for complex strategies such as Monte Carlo trees, step rewards, etc. Based on this reward + reinforcement learning training on a good basic model, the model can think deeply before answering, thereby stimulating the model's stronger intelligence.

There was a paper published the other day, and the core contributors were from the Xiaohongshu team. Why do we say that they "seem to have solved the puzzle of o3"?

The core capability of o3 is to call tools, so that you can think while looking at the diagram. Every time you think, an action is generated. The observation results obtained after calling the tool will enter the input of the model, and then generate the next action result.

The motivation mentioned in the paper is based on an observation: when people face a visual problem, they definitely don’t just look at the picture and then frantically make textual inferences in their minds. Instead, they think while looking. Even some experts who are born with "visual thinking" mainly think in images and visualizations.

As mentioned earlier, the way many VLMs process images is still too "rough". When the image is first input to the model, it is converted into an image embedding by the vision encoder at one time, and then the image becomes a purely static background. After that, all the "thinking" of the model is almost all pure text CoT. This method is undoubtedly "lossy" when processing complex visual information.

If we can make VLM truly think while looking at pictures, it may evolve some thinking methods that are completely different from text CoT and unique to the visual field.

There is a very difficult problem here: VLM itself cannot directly output image tokens, so how can we make the model think while looking at the image?

The answer given by DeepEyes is - Agent framework! The model is equipped with an image tool: image zoom-in. The model can call this tool by outputting the coordinates of the bounding box, so that it can observe the part of the image that it is interested in or thinks is important according to its own "will".

Why choose the image local magnification tool?

First, this tool can be applied to any visual understanding and reasoning task. Second, this tool relies on the model’s localization output of the image area, and the accuracy can be optimized through end-to-end RL training.

Finally, how should we reward the agent for its RL training work? The best reward is to give an extra reward only if the problem is solved correctly and the tool is successfully called (the reward for solving the problem correctly without calling the tool < the reward for calling the tool and solving it correctly). This will give the most stable training and the best results.

Their model performs very well and can be used to fight against o3 in certain scenarios.

Finally, the mystery is solved, and it may feel easy, just like R1. However, there are many more pitfalls in making a seemingly simple idea work than many people imagine.