Chat GPT text image does not use DALL·E model?

Open AI's ChatGPT can now generate images independently, no longer relying on the DALL-E model, and its image generation capabilities have been significantly improved.
Core content:
1. New function of ChatGPT: directly generate images without DALL-E
2. Image generation is more accurate, in line with user requirements, and supports detail modification
3. Actual measurement comparison: The difference between ChatGPT and Jimeng AI in image details and clarity
Last night, Open AI updated ChatGPT's text-to-image capabilities.
To be precise, this upgrade is a small revolution. Previously, it used DALL-E to generate images, but now the function is directly integrated into ChatGPT.
The new capability makes the images generated by ChatGPT more accurate. What does accurate mean? According to the official definition, it can meet your requirements. If you ask it to generate a cat wearing glasses, it will think about it first, and then draw a more detailed cat wearing glasses.
The last item is to modify the image. If you are not satisfied with any details, just tell it and it will modify it.
Several examples were also demonstrated during the official live event. Two researchers took a photo with Altman, and then asked ChatGPT to turn this ordinary photo into an anime-style painting. As a result, ChatGPT completed the task easily.
Another thing is that the team asked ChatGPT to add some text to the generated image, such as writing "Feel The AGI" on the picture. ChatGPT also did it smoothly.
After reading a lot of introductory articles, I felt that it was a bit exaggerated, so I met a friend in the morning and tried it in the afternoon. The question is, how to distinguish the differences in capabilities between different models?
I asked Qwen to write a prompt for me:
Imagine a cyberpunk-style scene, with neon lights flashing, tall buildings everywhere, large advertising screens on the upper floors, hover cars running on the streets, drones flying in the sky, a purple moon hanging in the sky, and pedestrians on the ground wearing high-tech clothes. Looking down at the entire city from a high altitude, the picture should be high-definition and the more details the better.
After writing, I gave the prompt words to GPT and Jimeng AI respectively. In less than 20 seconds, GPT produced a picture. Compared with Jimeng AI's picture, each model has a different understanding of the prompt words. Both pictures have a cyberpunk feel, but they have their own characteristics in details.
To evaluate, both pictures are blurry. But the operation of Jimeng AI is relatively convenient. Simply click on the detail repair and ultra-clear functions to effectively improve the clarity, and the effect is obvious.
GPT is a little lacking in this aspect. I asked it to generate a higher-resolution picture, but it generated another picture. Unfortunately, it still did not meet my expectations.
Caption: Left, Chat GPT; Right, Jimeng AI
Therefore, from the perspective of image clarity controllability, GPT may be slightly weaker. However, it also has its own advantages; for example, when I asked for a 1:1 size image, it would give two different solutions and ask me, which one do you think is higher? Which one do you prefer?
I tried several more prompts, but the result was still the same.
I also tried its new ability: the world knowledge function. Officials said that this function allows AI to better understand and use knowledge from the real world when generating pictures, making the pictures more in line with user requirements and more in line with actual logic.
To put it simply, when AI draws pictures, it will take into account details in reality, such as geographical location, cultural background, and physical rules. For example, when drawing a snowy mountain, tropical plants will not appear, and when drawing an ancient scene, a mobile phone will not suddenly appear.
So, I asked Qwen to write another prompt for me:
Generate a diagram that uses the action of two people standing on skateboards pushing each other to explain Newton's third law. The diagram should be intuitive and clearly show the relationship between the action and reaction forces.
How do you evaluate it after you give it to me? It looks like that. It can show the relationship between two people pushing each other on a skateboard, and it also adds some arrows and English explanations; but why do I feel that this function is like an image PPT function?
Then, I tested it for several rounds, generating a person's head skeleton and body skeleton respectively. If the full score is 10, I would give it 6 points at most, because most of the capabilities of ByteDance and Tencent's Wenshengtu models can do it.
Sam Altman spoke highly of this product, saying it was hard to believe it was generated by AI. He believed that everyone would like it and looked forward to users using it to create more creative content.
His goal is to avoid generating offensive content as much as possible, and he believes it is right to give creative freedom and control to users, but he will also pay attention to actual usage and listen to social opinions.
Altman hopes that everyone understands that they are trying to balance freedom and responsibility and ensure that the development of AI meets everyone's expectations and ethical standards. These are all old sayings.
I think that rather than its current generation capabilities, we should pay more attention to why it should replace the DALL·E model. You know, the DALL·E model is a model released by OpenAI in January 2021. As an old model, shouldn't it continue to iterate to make it more powerful?
In fact, the key is that the core architecture of the DALL-E model is an autoregressive model.
What is an autoregressive model?
It uses its own historical data to predict future data. It works by breaking down the image into a series of tokens (similar to words in text), and then generating images one token at a time, just like writing an essay.
For example:
If you want to draw a cat, you first draw the head of the cat, then draw the eyes according to the shape of the head, and then draw the nose according to the relationship between the eyes and the head, step by step, without skipping any step. This is how the autoregressive model works.
The advantage is that it can guarantee details, but the disadvantages are obvious: first, the speed cannot keep up, and second, if the drawing is wrong in the front, it is difficult to adjust later; therefore, OpenAI chose to replace it with a new model.
So, what does the replacement model look like? The answer is non-autoregressive models, which change the core architecture.
This framework will first understand the structure and details of the entire picture, just like a student first listens to the teacher's explanation of the topic, then figures out the outline of the whole picture, and then fills in the details bit by bit. For example: to draw a cat, first outline the cat's appearance, then refine the fur and eyes.
This model uses a special encoding and decoding architecture to achieve its goal. In simple terms, the encoder is responsible for "reading the question" and understanding the text you input; the decoder is responsible for "answering" and generating images based on the text.
The advantages are: first, it is no longer like the old method that generates images step by step and pixel by pixel, which is more efficient; second, the overall performance is stronger, especially in complex scenes, it can better handle the relationship between multiple objects, and the generated images are more realistic.
For example, when drawing a scene with a cup, a book, and a lamp on a table, the non-autoregressive model can handle the position and lighting effects of objects more naturally without looking messy. Moreover, it understands complex text instructions better, and the generated pictures and descriptions are basically logical.
Another feature is that this model is highly flexible and can be integrated into multimodality. For example, if it is inserted into ChatGPT 4.0, it can not only look at pictures, but also combine audio or existing images to generate more diverse content.
Therefore, this step by Open AI is essentially a small self-revolution.
During the Spring Festival, DeepSeek released a new image model, Janus-Pro. Those who have paid attention to it should know that it uses a non-autoregressive framework. The Janus-Pro-7B model in the Janus series has an accuracy of 80% in GenEval, even exceeding the 61% of DALL-E 3.
I checked and found that this non-autoregressive model was first proposed at the ICLR conference in 2018 and was originally applied in the field of neural machine translation (NMT) to speed up reasoning.
According to the paper review, Microsoft conducted further research on May 13, 2022. In China, around 2023, a series of companies such as Alibaba, iFlytek, Kunlun Wanwei, and CloudWalk Technology have introduced this technology.
So, has OpenAI seen the maturity of the application of this model in China and started to reflect on itself?