ChatGPT o3 vs DeepSeek R1 performance comparison, which one is better?

Written by
Audrey Miles
Updated on:July-10th-2025
Recommendation

The latest AI performance showdown, ChatGPT o3 and DeepSeek R1, which one is better?

Core content:
1. The core capabilities and market positioning of ChatGPT o3 and DeepSeek R1
2. Performance comparison in various fields: mathematical science reasoning, programming engineering capabilities
3. How to use ChatGPT in China, and related resource recommendations

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

ChatGPT o3 and DeepSeek R1

•  ChatGPT o3 focuses on "deep reasoning" capabilities, optimizing the efficiency of solving mathematics, programming and scientific problems by dynamically adjusting the reasoning intensity (low/medium/high). For the first time, the basic version (o3-mini) is open to free users, aiming to expand the user base and lower the threshold for using AI.

•  DeepSeek R1 takes "cost revolution" as its core selling point, adopts an open source ecosystem and extremely compressed training costs (only US$5.6 million), adapts to domestic chips (such as Huawei Ascend), and focuses on small and medium-sized developers and the enterprise market. It is called the "Pinduoduo of the AI ​​world."

Performance comparison

1.  Mathematical and Scientific Reasoning

•  AIME 2024 Mathematics Competition : o3-mini’s accuracy rate under high reasoning intensity is 87.3% vs R1’s 79.8%; but in low-intensity mode, R1 (71.5%) surpasses o3 (60%).

 •  Doctoral-level scientific questions (GPQA) : o3 has a maximum accuracy of 79.7%, slightly better than R1’s 71.5%; however, R1 has a lower error rate in unstructured data processing.

 •  Interdisciplinary comprehensive capabilities : o3 achieved 87.5% accuracy in the ARC-AGI test (the human level threshold is 85%), while DeepSeek did not disclose similar data.

2.  Programming and engineering skills

•  Code generation (SWE-bench) : o3 scored 71.7 vs R1’s 71.6, but the code generated by R1 has better execution integrity and stability (such as no "penetration" problem).

 •  Competitive Programming (Codeforces) : o3 Elo score is 2727, significantly higher than R1 (specific value not disclosed).

3.  Anti-hallucination and reasoning stability

•  Bayesian reasoning experiment : o3-mini had the highest accuracy rate (88%) under the prompt condition, and the reasoning process was concise and logically clear; R1 had the correct conclusion but the process was lengthy and confusing, and the number of words used was 3-10 times that of o3. 

•  Security audit : o3 filters harmful content through deep alignment technology, while R1 has a jailbreak attack vulnerability.

How to use ChatGPT in China

To use chatgpt in China, you usually go through a mirror website or share a house. You can follow me and send " share a house " to get detailed information.