Just now, OpenAI released o3 and o4-mini, which have explosively improved multimodal reasoning capabilities!!!

Written by

Jasper Cole

Updated on:June-30th-2025

Prompt word

https://openai.com/index/thinking-with-images/ Output the content of this webpage as a WeChat public account tweet with both text and pictures.

On April 16, 2025, OpenAI released its latest artificial intelligence models o3 and o4-mini, which have made breakthrough progress in the field of visual reasoning. According to OpenAI, these models can not only "see" images, but also "think" about images in the reasoning chain, significantly improving the ability to process visual information. They can crop, scale, rotate and perform other operations on images uploaded by users without relying on separate dedicated models. In addition, these models can be combined with tools such as web search, Python data analysis and image generation to provide a multimodal intelligent experience, bringing users unprecedented interaction methods.

The development of this technology may change the way we interact with AI, making it more intuitive and closer to the way humans process visual information. Below, we will introduce the practical applications, performance and future development directions of these models in detail.

Image Reasoning Practice

OpenAI demonstrated the powerful capabilities of o3 and o4-mini in visual reasoning through a series of examples. These examples not only reflect the technical strength of the model, but also demonstrate its application potential in real-world scenarios.

Example 1: Reading Notebook Text

In one example, the model analyzed a photo of a notebook with text inverted. By rotating the image and cropping to the text area, the model successfully read the content: "February 4 - Completed Roadmap". The entire reasoning process took only 20 seconds, demonstrating the model's efficiency in processing complex visual information.

(Note: The original article contains an image showing the text in the notebook. It is recommended to visit the original article to view it.)

Example 2: Solving a maze

Another striking example is the model solving a maze problem. The user uploaded an image of a maze, and the model completed the reasoning in 1 minute and 44 seconds, not only finding the correct path, but also drawing the path with a red line and generating an image of the solved maze. This process involves image processing techniques such as thresholding and dilation operations, demonstrating the model's ability in complex visual tasks.

(Note: The original article contains images of the maze and its solution path. It is recommended to visit the original article for viewing.)

These examples show that o3 and o4-mini are able to handle a wide variety of vision tasks, from simple text recognition to complex path planning, providing users with powerful tools.

Performance Benchmarks

To evaluate the performance of o3 and o4-mini, OpenAI compared them with previous models GPT-4o and o1 in multiple visual task benchmarks. The tests were all conducted under high "inference effort" settings to ensure that the results reflect the maximum potential of the model. Here are the detailed performance data:

Benchmarks	Task Description	GPT-4o	o1	o4-mini	o3
MMMU	University Level Visual Problem Solving	68.7%	77.6%	81.6%	82.9%
MathVista	Visual Mathematical Reasoning	61.4%	71.8%	84.3%	87.5%
VLMs are Blind	Basics of Visual Perception	50.4%	57.0%	87.3%	86.2%
CharXiv - Descriptive	Scientific chart description	85.3%	88.9%	94.3%	95.0%
CharXiv-Reasoning	Scientific graph reasoning	52.7%	55.1%	72.0%	75.4%
V*	Visual Search Benchmark	73.9%	69.7%	94.6%	95.7%

Key Observations

Significant improvement
: o3 and o4-mini surpass GPT-4o and o1 in all tests, especially in MathVista and V* benchmarks.
V* Benchmark Breakthrough
:o3 achieves 96.3% accuracy on the V* benchmark, almost completely solving this visual search task and marking a major advance in visual reasoning technology.
No browsing reasoning
: These models achieve performance improvements through image thinking without relying on external browsing, demonstrating the strength of their intrinsic reasoning capabilities.

These results demonstrate that O3 and O4-mini set new industry benchmarks in visual reasoning tasks, providing more powerful tools for academic research and practical applications.

Limitations and future directions

Although o3 and o4-mini have achieved impressive results, they still have some limitations and need further improvement:

limitation	describe
The chain of reasoning is too long	The model may perform redundant tool calls and image operations, making the inference process complex and time-consuming.
Perception Error	The model may make basic perceptual errors that may result in the wrong final answer even if the tool is called correctly.
Reliability issues	Multiple attempts may produce different reasoning processes, leading to inconsistent results and affecting reliability.

Future plans

OpenAI said they are working to optimize these models to address the above issues. Specific plans include:

Simplify the reasoning process
: Reduce redundant operations and make the reasoning chain more concise and efficient.
Improve accuracy
: Improve perception, reduce errors, and ensure more reliable output.
Enhanced reliability
: Optimize the model architecture to ensure consistent results over multiple inferences.
Multimodal research
: Continue to explore multimodal reasoning technology to further enhance the comprehensive capabilities of the model in vision, text and other data types.

These improvements will enable o3 and o4-mini to play a role in a wider range of application scenarios in the future, such as education, scientific research, and creative design.

in conclusion

OpenAI's o3 and o4-mini models have opened a new chapter in artificial intelligence through image thinking. They can not only handle complex visual tasks, but also be combined with a variety of tools to provide users with a multimodal intelligent experience. Despite some limitations, OpenAI's continued research and optimization plans show that future models will be more efficient and reliable.