Just now, OpenAI released GPT-4.1, 1 million contexts, super code capabilities

Written by

Audrey Miles

Updated on:June-28th-2025

At 1:30 this morning , OpenAI announced the opening of GPT-4.1 , which can be used in ChatGPT starting today .

GPT-4.1 is a model specifically designed for coding tasks and instruction execution. It has very high reasoning efficiency and is a very good choice to replace o3 and o4-mini for daily coding needs .

GPT-4.1 is the latest model released by OpenAI . One of its biggest highlights is that it supports 1 million tokens context. This is also the first time that OpenAI has released a long window model.

Compared to previous models, GPT-4.1 , GPT-4.1Mini , and GPT-4.1Nano are able to process contexts of up to 1 million tokens , which is 8 times that of GPT-4o .

OpenAI tested long texts on LongContextEvals. The test results showed that the three models in the GPT-4.1 series can find the target text at any depth in the corpus, whether it is the beginning, middle or end, and even in a context of up to 1 million tokens , the model can still accurately locate the target text.

OpenAI also conducted tests on Multi-Round Coreference , testing the model's ability to understand and reason in long contexts by creating synthetic dialogues.

In these conversations, the user and the assistant alternate between conversations where the user might ask the model to generate a poem on a certain topic, then another poem on a different topic, then perhaps a short story on a third topic. The model needs to find specific content in these complex conversations, such as "a second short story on a certain topic."

Test results show that GPT-4.1 is significantly better than GPT-4o when processing data up to 128K tokens , and can still maintain high performance in contexts up to 1 million tokens .

In the coding ability test, the SWEBench evaluation puts the model in a Python code base environment, allowing it to explore the code base, write code, and test cases. The results show that GPT-4.1 has an accuracy of 55% , while GPT-4o is only 33% .

In terms of multi-language coding ability testing, the Aderpolyglot benchmark covers multiple programming languages and different format requirements. GPT-4.1 has doubled its differential performance compared to GPT-4o , and is more efficient in handling multi-language programming tasks, code optimization, and version management.

In the instruction-following ability test, OpenAI built an internal evaluation system to simulate the usage scenarios of API developers and test the model's ability to follow complex instructions. Each sample contains complex instructions belonging to different categories and divided into difficulty levels. In the difficult subset evaluation, GPT-4.1 far exceeds GPT-4o .

In the video MME benchmark of the multimodal processing test , GPT4.1 understood 30-60 minutes of uncaptioned videos and answered multiple-choice questions, achieving a score of 72% , reaching the current best level and achieving a major breakthrough in video content understanding.

In terms of price, the GPT-4.1 series has improved performance while being more competitively priced. GPT-4.1 is 26% cheaper than GPT-4o , and GPT-4.1Nano , as the smallest, fastest and cheapest model, costs only 12 cents per million tokens .

GPT-4.1 is currently available to Plus , Pro , and Team users via “ More Models ” in the model selector . Enterprise and Education users will gain access in the coming weeks.

OpenAI also launched GPT-4.1-mini for all users in ChatGPT , replacing GPT-4o-mini .