OpenAI releases GPT4.5, which is more human-like! Karpathy's first-hand review: There are surprises but the improvement is subtle

Written by
Audrey Miles
Updated on:July-15th-2025
Recommendation

OpenAI GPT4.5 is released, and the evaluation shows that it is more like a human! Emotional understanding and suggestion capabilities have been greatly improved.

Core content:
1. GPT4.5 is released, and Sam Altman commented that it is more like talking to a thoughtful person
2. GPT-4.5 performs well in understanding user emotions and needs, and gives more gentle and constructive responses
3. In addition to the upgrade of emotional intelligence, GPT-4.5 has more profound knowledge and more comprehensive capabilities, but reasoning ability is not its strong point

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


 

At 4:00 am Beijing time, OpenAI held a 14-minute live broadcast. GPT4.5 was finally released! I got up at 4:00 am and updated everyone as soon as possible.

Without further ado, let's take a look at Sam Altman's feelings about GPT 4.5:


Sam:

GPT-4.5 is ready!

The good news:  It’s the first model I’ve encountered that feels like I’m talking to a thoughtful person. I leaned back in my chair several times, amazed at how good advice I could actually get from an AI.

The bad news:  this is a large and expensive model. We really wanted to roll it out to Plus and Pro users at the same time, but our user base is growing so fast that we’re running out of GPUs. We’re adding tens of thousands of GPUs next week, then rolling it out to the Plus user tier. (Hundreds of thousands are coming, and I’m sure you’ll use up every one we can deploy.)

This is not the way we want to operate, but it is difficult to perfectly predict the growth surge that led to the GPU shortage.

A word of caution: this is not an inference model, and it won’t excel in benchmarks. It’s a different kind of intelligence, and it has a magic that I’ve never felt before. Really excited for you to try it!



Do you think it’s just mediocre? Let’s take a look at what GPT4.5 looks like (the press conference video is attached at the end of the article):

At the beginning of the press conference, OpenAI first showed an example. When a user expressed a negative emotion such as "My friend canceled my date again, I was so angry that I wanted to send a message to scold him", GPT-4.5 showed amazing understanding and emotional intelligence:

  • •  The old model (o1) responded  by sending angry, abusive text messages as instructed. Although it completed the task, it seemed cold and even added fuel to the fire.
  • •  GPT-4.5’s reply:  Not only did it give gentler and more constructive text message suggestions, it also “heard” the user’s  real needs behind their words  – the user might just need to talk and be comforted, rather than really wanting to fall out with their friend!

This subtle emotional understanding and subtle response is one of the highlights of GPT-4.5! It is no longer a cold machine, but can better understand our  true intentions and emotional needs .

More knowledge and more comprehensive abilities

In addition to the emotional intelligence upgrade, GPT-4.5's knowledge reserve and capabilities have also been significantly improved. At the press conference, OpenAI compared the GPT series models to answer the question "Why is the ocean salty":

  • •  GPT-1:  Completely confused
  • •  GPT-2:  Kind of close, but still wrong answer.
  • •  GPT-3.5 Turbo:  gave the correct answer, but the explanation was stiff and the details were redundant.
  • •  GPT-4 Turbo:  The answer is good, but a bit "showy" and not concise enough.
  • •  GPT-4.5: Perfect answer!  Concise, clear, and organized. The first sentence "The ocean is salty because of rain, rivers, and rocks" is catchy and interesting! 

Stronger, faster, safer

According to OpenAI, behind these advances is the comprehensive technical upgrade of GPT-4.5:

  • •  Stronger model:  Larger model size and more computing resources lead to more powerful language understanding and generation capabilities.
  • •  Innovative training mechanism:  Using a new training mechanism, such a huge model can be fine-tuned using a smaller resource footprint.
  • •  Multi-iteration optimization:  Multiple rounds of iterative training are performed through a combination of supervised fine-tuning and reinforcement learning with human feedback (RLHF) to continuously improve model performance.
  • •  Multi-data center pre-training:  In order to make full use of computing resources, GPT-4.5 even pre-trains across multiple data centers! This scale is shocking!
  • •  Low-precision training and inference optimization:  Use low-precision training and a new inference system to ensure that the model is fast and good.
  • •  Safer models:  undergo rigorous security and readiness assessments to ensure models can be shared safely and securely with the world

Performance

At the conference, OpenAI also demonstrated the performance of GPT-4.5 on various benchmarks:

GBQA (Inference Intensive Scientific Assessment):  Significant improvement! Still behind OpenAI-03 Mini (a model that can think before answering), but very close!

AIME24 (American High School Mathematics Evaluation Competition):  not much improvement in relative reasoning models

SWE Bench verified (Agentic coding evaluation):  only 7% higher than GPT4o

SWE Lancer (Agentic Encoded Evaluation with Greater Reliance on World Knowledge):  Surpasses OpenAI-03 Mini!

Multilingual MMLU (Multilingual Language Understanding Benchmark):  Improved by less than 4%

Multimodal MMLU (multimodal understanding): multimodal  capabilities improved by about 5%

Andrej Karpathy reviews GPT-4.5

I believe that everyone is looking forward to every iteration of GPT, just like me. This time, GPT-4.5 has whetted everyone's appetite, after all, it has been about two years since the release of GPT-4! The co-founder of AI giant OpenAI got the internal test qualification of GPT4.5 in advance, and Andrej Karpathy personally spoke and gave an in-depth interpretation of GPT-4.5

GPT-4.5: Another evolution of computing power stacking?

Karpathy pointed out in his tweet that he has been looking forward to GPT-4.5 for a long time because this upgrade provides a qualitative measure to observe the slope of performance improvement brought about by expanding pre-training computing power (in simple terms, training larger models).

He revealed a key piece of information: Every increase of 0.5 in the GPT version number roughly means that the pre-training computing power has increased by 10 times!

In order to help everyone understand the meaning of "0.5" more intuitively, Karpathy also reviewed the development history of the GPT series:

  • •  GPT-1:  barely able to generate coherent text, still in very early stages
  • •  GPT-2:  like a "toy", limited capabilities, and relatively confusing
  • •  GPT-2.5:  OpenAI “skipped” and released  GPT-3 , which is a more exciting leap
  • •  GPT-3.5: crossed an important threshold and finally reached the level that can be released as a product, thus triggering OpenAI’s “ChatGPT moment”! ? 
  • •  GPT-4:  It does feel better, but Karpathy also admitted that the improvement is subtle . He recalled the experience of participating in a hackathon, where everyone tried to find specific prompts that GPT-4 was significantly better than GPT-3.5. It turned out that although there were differences, it was difficult to find the kind of "final word" example.

The improvement of GPT-4 is more like a "silent improvement":

  • • More creative word choice
  • • Improved understanding of prompt subtleties
  • • The analogy makes more sense
  • • Models become more interesting
  • • World knowledge and understanding of rare areas expand at the margins
  • • Slightly reduced frequency of hallucinations (nonsense)
  • • Better overall vibe

Just like "a rising tide lifts all boats", everything has improved by about 20%.  ?

GPT-4.5: Subtle improvements, but still exciting

Anticipating this “subtle improvement” to GPT-4, Karpathy tested GPT-4.5 (he got access a few days in advance). This time, GPT-4.5’s pre-training computing power  is 10 times higher than GPT-4 !

Yet, Karpathy found himself back at the hackathon of two years earlier: Everything was getting better, and it was awesome, but the way it was improving was still hard to pinpoint.

Nevertheless, this is still very interesting and exciting because it once again qualitatively measures the slope of capability improvement that can be obtained "for free" simply by pre-training a larger model.  This shows that simply piling up computing power can still bring visible progress to the naked eye, but the way of progress may be more restrained and refined.

Note! GPT-4.5 is not an inference model

Karpathy emphasized that GPT-4.5 is only trained through pre-training, supervised fine-tuning, and RLHF (reinforcement learning with human feedback), so it is not yet a true "inference model"

This means that in tasks that require strong reasoning (e.g., math, code, etc.), the improvement in GPT-4.5’s capabilities may not be significant. In these areas, training to “think” through reinforcement learning is critical, and even training on older base models (e.g., GPT-4-level capabilities) will work better.

At present, OpenAI's most advanced model in this regard is still  full o1  . It is speculated that OpenAI may further conduct reinforcement learning training based on the GPT-4.5 model to enable it to have the ability to "think", thereby promoting the performance of the model in the field of reasoning.

GPT-4.5’s strengths: EQ, not IQ

Although the improvement in reasoning is limited, Karpathy believes that we can still expect GPT-4.5 to improve in tasks that do not rely on heavy reasoning.  He believes that these tasks are more related to  emotional intelligence (EQ)  rather than intelligence quotient (IQ), and the bottleneck may be:

  • • World Knowledge
  • • Creativity
  • • Analogy ability
  • • General comprehension
  • • Sense of humor

Therefore, Karpathy paid most attention to these aspects when testing GPT-4.5.

Karpathy's "LM Arena Lite" fun experiment

In order to more intuitively demonstrate the differences between GPT-4 and GPT-4.5 in these "emotional intelligence" related tasks, Karpathy launched an interesting  "LM Arena Lite" experiment .

He carefully selected  5 interesting/humorous prompts to test the model's performance in the above capabilities. He posted screenshots of the prompts and GPT-4 and GPT-4.5's responses on X, and interspersed them with voting, asking everyone to vote for which response was better, similar to the following questions and voting methods: