Milestone, the GPT-4.5 large model officially passed the Turing test!

The field of artificial intelligence has ushered in a major breakthrough. The GPT-4.5 model was mistaken for a human at a rate of 73%, far exceeding other models.
Core content:
1. GPT-4.5 and LLaMa-3.1 passed the Turing test for the first time, with the rates of being mistaken for a human reaching 73% and 56% respectively
2. Experimental design: The judge had 8 rounds of 5-minute conversations with humans and AI to determine which one was a human
3. The most common strategy used by the judge: chatting and asking about daily activities, judging language style and interaction dynamics
GPT-4.5 was judged to be human 73% of the time, significantly higher than the proportion of real human participants selected. LLaMa-3.1 was judged human 56% of the time under the same prompts, which was not significantly different from human participants.
The Turing test was proposed by Alan Turing in 1950 to determine whether a machine can demonstrate intelligence similar to that of a human. In the test, a human judge simultaneously communicates with two humans and a machine through a text interface. If the judge cannot reliably identify the human, the machine is considered to have passed the test. The Turing test has been an important topic in the field of artificial intelligence for the past 75 years and is one of the important criteria for measuring artificial intelligence.
The Turing Test game interface, showing an ongoing conversation between an interrogator (green) and two witnesses (grey). The timer at the top shows the time remaining in the game.
Research Methods
Four AI systems were selected: GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA. Among them, GPT-4.5 and LLaMa-3.1 are systems based on large language models, while GPT-4o and ELIZA are used as baseline models.
Two prompt types were tested: NO-PERSONA and PERSONA , the latter containing more detailed instructions requiring the model to respond with a specific personality.
The study was conducted in two independent populations: UCSD psychology undergraduates and participants of the Prolific platform.
The experimental design is that in each round of dialogue, the referee has a dialogue with two humans and an AI system, and the referee needs to judge which one is the human. The experiment was carried out for a total of 8 rounds, and each round of dialogue lasted for 5 minutes . The referee gave a judgment after the dialogue and explained the reason. The researchers recorded the referee's judgment results and analyzed the winning rate of the AI system.
Experimental Results
Win rate : GPT-4.5-PERSONA has a win rate of 73% , LLAMA-PERSONA has a win rate of 56% , while GPT-4.5-NO-PERSONA and LLAMA-NO-PERSONA have a win rate of 36% and 38%, respectively. The baseline models ELIZA and GPT-4o-NO-PERSONA have the lowest win rates, at 23% and 21%, respectively.
Judge Accuracy : The judges’ accuracy in the ELIZA game was significantly above chance, indicating that the judges were able to distinguish between humans and ELIZA. However, in the GPT-4.5-PERSONA and LLAMA-PERSONA games, the judges’ accuracy was not significantly above chance, indicating that they could not reliably distinguish between humans and these models.
Tactics and rationale : The most common tactic used by referees was to engage in small talk with witnesses, asking about their daily activities or personal details. The rationale most often cited by referees was the witness's language style and interaction dynamics, such as whether more "human" language was used or whether the conversation flowed smoothly.