The four models released in the past two days

Written by

Audrey Miles

Updated on:July-09th-2025

Gemini 2.5 Pro: A comprehensive upgrade of the new generation of "thinking AI"

Google DeepMind officially released Gemini 2.5 Pro on March 25, 2025. This new generation of AI model is positioned as a "thinking AI" and aims to significantly improve the ability to handle complex problems through pre-response logical reasoning. The release of Gemini 2.5 Pro marks another major breakthrough for Google in the field of AI technology, bringing more possibilities for future AI applications.

1. Technical highlights: comprehensive improvement in performance and capabilities

Gemini 2.5 Pro has achieved significant upgrades in many aspects, including:

• Performance advantage: Gemini 2.5 Pro leads the LMArena rankings by a significant margin and breaks records in math, science, and coding benchmarks. In addition, in the "Humanity's Last Exam" test that simulates the boundaries of human knowledge, Gemini 2.5 Pro scored 18.8% without tool assistance, setting an industry record. These data fully demonstrate Gemini 2.5 Pro's outstanding ability in knowledge acquisition and problem solving.

• Breakthrough in Reasoning Capabilities: Gemini 2.5 Pro combines an enhanced base model with improved training technology to significantly improve information analysis, logical reasoning, and contextual decision-making capabilities. This means that Gemini 2.5 Pro can not only understand information, but also conduct deeper analysis and reasoning to better solve complex problems.

• Evolution of coding capabilities: In SWE-Bench, Gemini 2.5 Pro achieved a score of 63.8% through a customized proxy architecture, which is a significant improvement over its predecessor. Even more surprising is that Gemini 2.5 Pro can generate executable code through a single line of instructions and independently develop visual web applications and video games, which greatly reduces the threshold for programming and provides more possibilities for creative realization.

• Multimodal and long context support: Gemini 2.5 Pro inherits the multimodal features of the Gemini series and can parse text, audio and video, images, and complete code bases. In addition, Gemini 2.5 Pro initially carries a context window of 1 million tokens and plans to expand it to 2 million tokens. This means that Gemini 2.5 Pro can handle longer texts and more complex scenarios, so as to better understand and solve problems.

2. New capability: Pre-response logic reasoning

The biggest highlight of Gemini 2.5 Pro is its "pre-response logic reasoning" capability. This capability enables Gemini 2.5 Pro to:

• Predict potential solutions to problems in advance: Before receiving a problem, Gemini 2.5 Pro is able to predict and evaluate possible solutions.

• Select the optimal reasoning path: By evaluating different solutions, Gemini 2.5 Pro is able to select the optimal reasoning path, thereby solving the problem faster and more accurately.

• Reduce computing resource consumption: Due to advance prediction and evaluation, Gemini 2.5 Pro can reduce unnecessary computing resource consumption and improve efficiency.

This pre-response logical reasoning capability enables Gemini 2.5 Pro to demonstrate higher efficiency and accuracy when dealing with complex problems, especially in scenarios that require quick response, where the advantages of Gemini 2.5 Pro are more obvious.

DeepSeek V3-0324: Improved coding capabilities and further reasoning

DeepSeek V3-0324 is a large language model released by DeepSeek, which has significant improvements in coding ability, reasoning ability and Chinese writing ability. This upgrade not only improves the performance of the model itself, but also lowers the threshold for developers to use it, bringing new possibilities for various AI application scenarios.

1. Upgrade content: Multiple key capabilities improved

The main upgrades of DeepSeek V3-0324 include:

• Excellent reasoning ability: It performs well in benchmarks such as MMLU-Pro, GPQA, and AIME, demonstrating its ability to handle complex reasoning tasks. This means that the model is more reliable when solving problems that require deep thinking and analysis.

• Strong coding ability: Especially good at front-end web development, able to generate concise and efficient code with high code executableness. User tests show that V3-0324 can generate hundreds of lines of error-free web page code at one time, achieving dynamic responsive layout and interactive effects.

• Excellent Chinese writing skills: The writing style and content quality are high, suitable for generating high-quality Chinese texts and providing strong support for content creators.

• Accurate function calling capabilities: The ability to accurately call functions improves the efficiency of task completion, which is critical to automated workflows.

2. New capabilities: code generation capabilities have been greatly improved

The most significant new feature of DeepSeek V3-0324 is the improvement of code generation capability.

• Executable code generation: It can generate code that can be run directly, which is especially effective in front-end web development and greatly reduces developers' debugging time.

• Complex code generation: It is capable of handling complex and advanced coding tasks, such as code generation for complex Web applications and large software systems, reducing the difficulty of development.

• Generate a large amount of code at one time: Able to generate hundreds or even thousands of lines of code at one time, improving development efficiency and shortening the development cycle.

3. Generate front-end code effect:

Qwen2.5-VL-32B-Instruct: A Leap in Visual Intelligence

Alibaba's open source Qwen2.5-VL-32B-Instruct model performs well in multimodal understanding and mathematical reasoning.

1.Technical highlights:

Reinforcement learning optimization: Through reinforcement learning, the ability to solve complex mathematical problems and user experience are significantly improved.
Visual comprehension: Not only is it good at recognizing common objects, it can also efficiently analyze text, charts, icons, etc. in images.
Agent capabilities: Can be used directly as a visual agent and has the ability to operate computers and mobile phones.
Video understanding capabilities: Able to understand videos up to 1 hour long and accurately locate key segments.
Structured output capability: Supports structured output of invoices, forms and other data, suitable for finance, commerce and other fields.
Architecture update: Dynamic FPS sampling is used to enable the model to understand videos of various sampling rates; the training and inference speeds are improved through the window attention mechanism.

2. Future applications:

Multimodal AI Agent Deployment: 32B parameter size is considered ideal for multimodal AI Agent deployment.
Fine-grained image understanding and reasoning: It has advantages in tasks such as image parsing, content recognition, and visual logic deduction.
Solving complex mathematical problems: Significantly improve the accuracy of solving mathematical problems.

GPT-4o: A native multimodal innovation

OpenAI's GPT-4o achieves true native multimodality and shows excellent performance in image generation.

1.Technical highlights:

Natively multimodal: Able to simultaneously process and understand multiple inputs such as text, images, and audio, and generate outputs in any combination.
Precise text rendering: Excels at accurately rendering text in images, ideal for creating logos, menus, and invitations.
Multi-round generation capability: Ability to build based on images and text in the chat context, ensuring consistency throughout the process.
Detailed instructions to follow: Can handle up to 10-20 different objects.
Style Adaptability: Images can be generated or converted into a variety of styles, from photorealism to stylized illustrations.

2. Future applications:

Design & Branding: Generate logos, posters, and advertisements with precise text placement.
Education and Visualization: Create scientific charts, infographics, and historical images.
Game Development: Keep characters consistent across different design iterations.
Marketing and Content Creation: Producing social media assets, event invitations, and digital illustrations.

3. Effect example: