Milestone, the GPT-4.5 large model officially passed the Turing test!

Written by
Clara Bennett
Updated on:July-02nd-2025
Recommendation

The field of artificial intelligence has ushered in a major breakthrough. The GPT-4.5 model was mistaken for a human at a rate of 73%, far exceeding other models.

Core content:
1. GPT-4.5 and LLaMa-3.1 passed the Turing test for the first time, with the rates of being mistaken for a human reaching 73% and 56% respectively
2. Experimental design: The judge had 8 rounds of 5-minute conversations with humans and AI to determine which one was a human
3. The most common strategy used by the judge: chatting and asking about daily activities, judging language style and interaction dynamics

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Important milestone! Researchers from the University of California, San Diego recently provided the first empirical evidence that artificial systems ( LLaMa-3.1-405B and GPT-4.5 ) passed the standard three-way Turing test.
  • GPT-4.5 was judged to be human 73% of the time, significantly higher than the proportion of real human participants selected.
  • LLaMa-3.1 was judged human 56% of the time under the same prompts, which was not significantly different from human participants.

The Turing test was proposed by Alan Turing in 1950 to determine whether a machine can demonstrate intelligence similar to that of a human. In the test, a human judge simultaneously communicates with two humans and a machine through a text interface. If the judge cannot reliably identify the human, the machine is considered to have passed the test. The Turing test has been an important topic in the field of artificial intelligence for the past 75 years and is one of the important criteria for measuring artificial intelligence.

The Turing Test game interface, showing an ongoing conversation between an interrogator (green) and two witnesses (grey). The timer at the top shows the time remaining in the game.

Research Methods

  • Four AI systems were selected: GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA. Among them, GPT-4.5 and LLaMa-3.1 are systems based on large language models, while GPT-4o and ELIZA are used as baseline models.

  • Two prompt types were tested: NO-PERSONA and PERSONA , the latter containing more detailed instructions requiring the model to respond with a specific personality.

  • The study was conducted in two independent populations: UCSD psychology undergraduates and participants of the Prolific platform.

  • The experimental design is that in each round of dialogue, the referee has a dialogue with two humans and an AI system, and the referee needs to judge which one is the human. The experiment was carried out for a total of 8 rounds, and each round of dialogue lasted for 5 minutes . The referee gave a judgment after the dialogue and explained the reason. The researchers recorded the referee's judgment results and analyzed the winning rate of the AI ​​system.

Experimental Results

  • Win rate : GPT-4.5-PERSONA has a win rate of 73% , LLAMA-PERSONA has a win rate of 56% , while GPT-4.5-NO-PERSONA and LLAMA-NO-PERSONA have a win rate of 36% and 38%, respectively. The baseline models ELIZA and GPT-4o-NO-PERSONA have the lowest win rates, at 23% and 21%, respectively.

  • Judge Accuracy : The judges’ accuracy in the ELIZA game was significantly above chance, indicating that the judges were able to distinguish between humans and ELIZA. However, in the GPT-4.5-PERSONA and LLAMA-PERSONA games, the judges’ accuracy was not significantly above chance, indicating that they could not reliably distinguish between humans and these models.

  • Tactics and rationale : The most common tactic used by referees was to engage in small talk with witnesses, asking about their daily activities or personal details. The rationale most often cited by referees was the witness's language style and interaction dynamics, such as whether more "human" language was used or whether the conversation flowed smoothly.

Left: The win rate for each AI witness, i.e., the proportion of times interrogators judged the AI ​​system to be human rather than actual human witnesses. Error bars represent 95% bootstrap confidence intervals. An asterisk next to each bar indicates whether the win rate is significantly different from chance (50%). Right: Interrogators’ confidence in their verdict for each witness type when choosing actual human or AI model. Each point represents a separate game. Points to the left or right indicate higher confidence that the AI ​​is AI rather than human. Error bars represent 95% bootstrap confidence intervals around the mean.
A breakdown of the strategies used by interrogators, with the percentage of games in which each strategy was used shown on the left, and the average accuracy of games using these strategies shown on the right , with 95% confidence intervals. Participants frequently engaged in small talk, asking witnesses for personal details, activities, or opinions. Interrogators who used unusual utterances or used typical Large Language Model (LLM) "jailbreaking" techniques were the most accurate.
The full PERSONA prompts used to guide LLM-based AI agents in how to respond to interrogator messages in the Prolific study.