Just now, OpenAI released o3 and o4-mini, which have explosively improved multimodal reasoning capabilities!!!

OpenAI's latest breakthrough! The o3 and o4-mini models have achieved explosive improvements in multimodal reasoning capabilities.
Core content:
1. The breakthrough progress of the o3 and o4-mini models in the field of visual reasoning
2. Practical application examples of the model: reading notebook text, solving maze problems
3. Performance benchmarks: o3 and o4-mini surpass previous models in multiple visual tasks
https://openai.com/index/thinking-with-images/ Output the content of this webpage as a WeChat public account tweet with both text and pictures.
On April 16, 2025, OpenAI released its latest artificial intelligence models o3 and o4-mini, which have made breakthrough progress in the field of visual reasoning. According to OpenAI, these models can not only "see" images, but also "think" about images in the reasoning chain, significantly improving the ability to process visual information. They can crop, scale, rotate and perform other operations on images uploaded by users without relying on separate dedicated models. In addition, these models can be combined with tools such as web search, Python data analysis and image generation to provide a multimodal intelligent experience, bringing users unprecedented interaction methods.
The development of this technology may change the way we interact with AI, making it more intuitive and closer to the way humans process visual information. Below, we will introduce the practical applications, performance and future development directions of these models in detail.
Image Reasoning Practice
OpenAI demonstrated the powerful capabilities of o3 and o4-mini in visual reasoning through a series of examples. These examples not only reflect the technical strength of the model, but also demonstrate its application potential in real-world scenarios.
Example 1: Reading Notebook Text
In one example, the model analyzed a photo of a notebook with text inverted. By rotating the image and cropping to the text area, the model successfully read the content: "February 4 - Completed Roadmap". The entire reasoning process took only 20 seconds, demonstrating the model's efficiency in processing complex visual information.
(Note: The original article contains an image showing the text in the notebook. It is recommended to visit the original article to view it.)
Example 2: Solving a maze
Another striking example is the model solving a maze problem. The user uploaded an image of a maze, and the model completed the reasoning in 1 minute and 44 seconds, not only finding the correct path, but also drawing the path with a red line and generating an image of the solved maze. This process involves image processing techniques such as thresholding and dilation operations, demonstrating the model's ability in complex visual tasks.
(Note: The original article contains images of the maze and its solution path. It is recommended to visit the original article for viewing.)
These examples show that o3 and o4-mini are able to handle a wide variety of vision tasks, from simple text recognition to complex path planning, providing users with powerful tools.
Performance Benchmarks
To evaluate the performance of o3 and o4-mini, OpenAI compared them with previous models GPT-4o and o1 in multiple visual task benchmarks. The tests were all conducted under high "inference effort" settings to ensure that the results reflect the maximum potential of the model. Here are the detailed performance data:
Key Observations
- Significant improvement
: o3 and o4-mini surpass GPT-4o and o1 in all tests, especially in MathVista and V* benchmarks. - V* Benchmark Breakthrough
:o3 achieves 96.3% accuracy on the V* benchmark, almost completely solving this visual search task and marking a major advance in visual reasoning technology. - No browsing reasoning
: These models achieve performance improvements through image thinking without relying on external browsing, demonstrating the strength of their intrinsic reasoning capabilities.
These results demonstrate that O3 and O4-mini set new industry benchmarks in visual reasoning tasks, providing more powerful tools for academic research and practical applications.
Limitations and future directions
Although o3 and o4-mini have achieved impressive results, they still have some limitations and need further improvement:
Future plans
OpenAI said they are working to optimize these models to address the above issues. Specific plans include:
- Simplify the reasoning process
: Reduce redundant operations and make the reasoning chain more concise and efficient. - Improve accuracy
: Improve perception, reduce errors, and ensure more reliable output. - Enhanced reliability
: Optimize the model architecture to ensure consistent results over multiple inferences. - Multimodal research
: Continue to explore multimodal reasoning technology to further enhance the comprehensive capabilities of the model in vision, text and other data types.
These improvements will enable o3 and o4-mini to play a role in a wider range of application scenarios in the future, such as education, scientific research, and creative design.
in conclusion
OpenAI's o3 and o4-mini models have opened a new chapter in artificial intelligence through image thinking. They can not only handle complex visual tasks, but also be combined with a variety of tools to provide users with a multimodal intelligent experience. Despite some limitations, OpenAI's continued research and optimization plans show that future models will be more efficient and reliable.