Misalignment of AI products

AI technology is developing rapidly, but why can't users feel its power? This article deeply explores the design misalignment problem behind AI products.
Core content:
1. Three leaps in AI model capabilities: from multi-round dialogue to multi-modality to long-term autonomous work
2. The backwardness of existing AI application design concepts: most apps are still at the multi-round dialogue stage
3. Systematic misalignment of AI products: Models evolve but apps fail to keep up, resulting in poor user experience
I have always been a staunch supporter of AI. Many of my friends ask me what AI can do, and I will give a very specific answer: it can help you plan your schedule, research materials, and even handle many trivial but time-consuming tasks in life such as placing orders and bargaining. I will provide prompt templates based on their scenarios to try to lower the threshold for them to try.
But I have recommended AI to hundreds of people in the past two years, and I have observed that most people’s feedback after using it is quite cold. The most common response is: It’s okay, I don’t think it’s very useful. Sometimes they will add: I might as well do it myself.
This contrast did not dampen my enthusiasm for leading the way for AI, but it did confuse me: the models are becoming more and more powerful, but why is it so difficult for ordinary users to feel it? In addition to "AI is a tool that needs to be learned", is there any deeper reason?
Recently, I have systematically compared the user experience of several mainstream AI clients on the market (Claude, Gemini, ChatGPT, and DeepSeek), and gradually realized that the misaligned perception of whether this AI is easy to use is due to problems in the product design that connects the two, in addition to users and models.
The capabilities of AI models are indeed evolving at an astonishing rate. As we will introduce below, it started with multi-turn conversations, and later added multi-modal capabilities. Now the most advanced AI can interact with various tools to achieve multi-hour autonomous work.
However, most of the current apps are still based on the Multi-Turn design concept, which is very different from the capabilities of LLM. Therefore, when the intelligence of AI is presented through an inappropriate interactive medium, users will be very frustrated.
For example, the Claude App is designed for short conversations. Once it switches to background tasks, it will be interrupted. So no matter how powerful the Claude 4 is, it is useless if it can execute tasks for several hours in the background. This is like putting an F1 engine in a Santana. Is it powerful? It is powerful. Is it easy to use? It drives about the same as a Peugeot.
Unfortunately, the details of how these apps are used constitute the entirety of the user’s perception. Most users don’t know that this is a problem with the app, and they just think that AI is not easy to use. This constitutes a systematic misalignment of AI products, which is also the topic that this article wants to discuss in detail.
#01
Multi-Turn, Multi-Modal to Multi-Hour Agency
These three leaps, from "being able to speak" to "being able to see and hear" and then to "being able to do", have gradually pushed AI from a question-answering tool to an intelligent assistant. OpenAI is moving in this direction, as is Google and Anthropic.
But the problem is that most of the AI apps we use today are still stuck in the interaction logic of the generation two years ago. For example, in a Santana, the engine has been gradually upgraded to an F1 engine, but the brakes and suspension have not changed. This is the fundamental reason why many people cannot feel how powerful AI is: the model is evolving, but the app has not kept up .
Multi-Turn: The Beginning of Chatbot
Multi-round conversation is the most basic capability of all mainstream models today.
One of the important reasons for ChatGPT's success is that it is not a search box like Google's smart search that answers questions one by one, but a system that can have continuous conversations around a task. The key technology behind this is Supervised Fine-Tuning (SFT), which uses multi-round conversation data annotated by humans to teach the model how to extract memories and answer questions.
Claude is also good at this. It is very good at summarizing and citing context, such as helping you read a paper, summarize a long document, or polish an article for multiple rounds. It does a great job.
At this stage, it is very simple to make an App. Basically, you just need to maintain the chat history and wrap a shell around the API. The experience of each company is similar. The only thing to pay attention to is the support for large context windows.
For example, the Gemini 2.5 series model supports 1M level context window, which is very important for many applications. However, its web page and client will freeze when the user enters thousands of tokens (which means less than 1% of the model capacity is used), making it almost unusable. This is a typical example of an app not keeping up.
Designing a product as a chatbot is fine in 2023, as AI is used for chatting. But now the model is no longer a simple chatbot, but a Copilot system that can handle structured tasks. If the App still stays in the old thinking, the model's potential will be greatly wasted.
Multi-Modal: From speaking to seeing and hearing
The second leap is multimodality. Today, mainstream models all claim to support multimodality, but they vary greatly.
Gemini 2.5 is currently the most thorough in this regard. It can natively view images, listen to audio, and understand videos. And it is not just a simple view, but can truly reason, combine, analyze, and summarize. The technical route behind it is to use different tokenizers combined with projection layers to map different modal information (images, sounds, text) into a shared representation space, so that the model can process the actions in the video and the tone of voice in the voice just like reading text.
OpenAI's route is similar, but there is no unified model that can achieve reasoning (similar to o3) and process video, audio and images (similar to GPT-4o-realtime). Its highlight is that it allows images to be used as objects for tool calls.
For example, o3 can write Python code to crop, zoom in, and identify details in an image, and then pass the processing results back to the model for tokenization and further analysis. This method has greatly improved its multimodal capabilities and even supports abnormal scenarios such as "guessing the location by looking at the picture" that only o3 can do.
Claude currently has basic support for multimodality, and can only perform image recognition, but not audio or video processing.
However, from the user experience point of view, the most advanced Gemini is actually the worst one. Because its web page and client do not support uploading videos and audios at all, but only images. This is a typical example of a model living in 2026, but the product is still in 2023. The product does not have the competitiveness of the adaptation model, and it is naturally difficult to differentiate the user experience.
Multi-Hour Agency: AI becomes a real assistant
The third change is that AI models have begun to have the ability to run continuously and complete tasks autonomously. We can call this stage Multi-Hour Agency, which means that AI can maintain context and scheduling tools to continuously complete a task that takes dozens of minutes or even hours, without you having to kick and move every time.
This is actually the premise for AI to become truly useful. Many important things, such as researching news in a certain field, planning a complete journey, analyzing a database, and generating a clearly structured code, are beyond the scope of question-answering robots. They essentially require a system that can think, call tools to supplement information, and automatically execute step by step or even dynamically adjust plans.
Claude 4 claims that it can run for seven hours to complete a particularly complex task. o3 can also call on many tools to perform very complex tasks in stages.
The realization of these capabilities is actually the continuous tuning of HFRL (human feedback reinforcement learning), function calls, external tool access, long context, etc. The model itself is ready to take over a complex process, but the App is not ready.
For example, no matter how powerful the Claude model is, the iOS App or even the Mac App will be disconnected as soon as the screen is turned off, and the chat history cannot be retrieved.
From multi-round dialogues to multimodal understanding and long-term task execution, the model's capabilities are layered. However, the App's capabilities have remained almost the same. The model is no longer a robot that answers questions, but a digital assistant that can complete tasks with you. However, the client still regards it as a search engine with lower latency and more natural tone at the product design level.
So the question is not whether AI is smart enough, but whether we have built a product structure that can support this intelligence. Most of the time, users are not evaluating the model, but the shell after the model is encapsulated in some form. And many companies (including large companies) don't spend any time to do that layer.
#02
Comparison of OpenAI, Claude, and Gemini products
Claude: The model is solid, but the app is a work in progress
Claude 3.7/4 This series of models is very strong, especially in long text reading, writing code, and not being lazy. It is even more stable than o3. It has received numerous good reviews on Cursor and is the go to model for many people. But the experience of Claude.ai, a consumer product, is really hard to describe.
Claude's client has a fatal problem: as soon as you switch apps, the reasoning is interrupted. It doesn't mean the task is paused or reconnected, but the entire conversation disappears from the history. It won't tell you that it's interrupted, but the task status becomes blank and the chat becomes Untitled in the history.
This problem will be triggered whether you turn off the screen on iOS or close the laptop on Mac. Fundamentally, this problem is because Claude's consumer products have not yet jumped out of the Chatbot idea, thinking that the App is just a wrapper for the API. Therefore, its architecture is highly dependent on the client, and the maintenance of the stream and the preservation of the session state are all placed on the user side.
This is fine when running a short question-and-answer session, but it is completely unsustainable when running complex tasks. Its iOS App implementation is also very rudimentary, and the phone gets hot when the model output is long. So no matter how powerful the model is, users will only say one thing: it is not easy to use.
The only differentiating factor here may be that Claude Desktop App is currently the only mainstream client that integrates MCP. You can directly use MCP to connect local resources to consumer-grade AI platforms, and use subscriptions instead of tokens for billing, which is quite practical.
Gemini: The model is very powerful and the app experience is like a demo
Google's Gemini is a more extreme example: the model capabilities are ridiculously strong, but the App is ridiculously poor.
AI Studio is a debugging suite from Google for developers. In this tool, Gemini is the model that supports the largest token window and the most robust video+audio+image+text mixed analysis that I have seen so far.
Uploading a 1 million-word document is stress-free, and running a 10-minute paper summary will not be disconnected. If you give it 100 repetitive tasks and ask it to do some boring repetitive processing, Gemini will not be lazy and complete it without compromise. Its Multi Modal, tool use, and especially instruction following capabilities are top-notch in the industry. I personally even think that it has left the second-tier models, including Claude and GPT, far behind.
The problem is that all of this can only be experienced in the web version of AI Studio. After all, this is a tool for developers. You have to stare at the web page foreground the whole time, the phone will be disconnected when the screen is locked, the system prompt will be automatically cleared every round, there is no way to personalize it, and the chat history is completely saved and shared on Google Drive, which is also very basic.
For consumer users, Google mainly promotes the Gemini App. But this app is... a very outrageous product, and it feels like it was made by the product department specifically to disgust the AI department.
Isn’t your Gemini 2.5 model 1M context window? Okay, I let the user enter a prompt word of about 10k tokens and the UI will freeze, putting you on the same starting line as other AIs. Isn’t your Gemini 2.5 particularly good at processing video and audio, and other companies don’t have this function? Okay, I don’t allow users to upload video and audio files in the UI, isn’t this the same as the functions of other AI products?
Users will be allowed to set system prompts for Gemini 2.5 in mid-2025 (BTW, the web version still has bugs, and the mobile version is not yet online). Even if I finally find a scenario where I can use the Gemini App, I will find that the intelligence it embodies is still far behind the intelligence in AI Studio. I will be more disgusted with using search to increase the breadth of answers and will be more inclined to just answer questions. I don’t know what negative optimizations have been made in the system prompt.
So many people, including me, had the first reaction after using the Gemini App: "That's it?" But in fact, they may not have used even 10% of the model's capabilities. You have to study Prompt and explore the usage of AI Studio yourself to barely tap its underlying potential. This is impossible for 99% of users.
ChatGPT: The one with the most mature product team
In contrast, OpenAI is far superior to the other two in terms of product experience. This is actually quite counterintuitive, because when we mention GPT, our first reaction is that it is the oldest LLM with the strongest model capabilities in the industry. We subconsciously think that OpenAI mainly relies on models to lead the competition and may not have time to refine its products.
But in fact, OpenAI's first place is in jeopardy. Although O3's tool use is still top-notch, its instruction following ability is still not as good as the other two. There is also a considerable gap in the length of the context window, multimodal capabilities (audio and video understanding), and price.
In contrast, ChatGPT's product experience is far ahead of the other two. It may even be the only product that can use 70% to 80% of the capabilities of the AI model behind it.
Let’s look at a few scenarios:
Asynchronous execution of tasks : An important scenario for AI is that when we are using our mobile phones on the road, we suddenly remember to use AI to do some research. So we enter something like "research XXX" in the App. Then we minimize the App and turn off the phone screen (you can also kill the App to simulate this).
At this time, ChatGPT will continue to research in the background. When you open the screen and reopen the App, you will find that the research has been completed and the latest results are displayed on the screen. However, Claude will fail 100% of the time in this scenario. The chat can still be found, but the title is Untitled and the content is empty. The Gemini App will most likely fail and the entire chat will disappear completely, but there is a small probability that the chat conversation will inexplicably reappear after an hour, and the content inside is correct.
This is actually a difference in product design ideas. Only OpenAI positioned ChatGPT as a tool that can help users handle tasks for a long time in the background. Although Claude emphasized this point in the API, he did not implement it in consumer products. Gemini's idea is similar.
iPhone photo analysis : If the user has enabled the iPhone's Raw photo, the photo will be a dng file instead of a jpeg or heic file. Whether it is intentional or accidental, this is actually a very common scenario, and it is difficult to see the difference in the iPhone's photo album.
If we upload this image directly, Gemini will report an error about disconnection from the server (what the hell), and Claude will report an error about the file type not being supported. Although not perfect, the error message is at least correct. But OpenAI knows to convert it to jpg first, and then upload it. This process is actually very simple, and the engineering cost is very low. Whether it is done or not depends entirely on the product strength, whether you really use this app, step on common pitfalls, and get the details right.
Massive text input : Select a large amount of text (e.g. 150,000 words) and paste it into the AI App or web page. Gemini will freeze directly after pressing send. If you have the patience to wait for a minute or two, it may recover. If you don’t have the patience to put the mobile app in the background, the entire chat will disappear as in the previous test. Claude and ChatGPT will report an error saying that it is too long and refuse to process, but if you reduce the text length a little, it can be processed normally.
In addition, there are many other details, such as whether system prompts can be set on the mobile phone, whether Deep Research will have progress updates on live activities, how deep the personalization is, etc., which I will not analyze one by one.
However, OpenAI is not without problems. For example, there are still differences between the web-side and app-side functions. For example, Deep Research, which is based on GitHub and SharePoint, is only supported on the web. In addition, there is no MCP support as of now. But overall, OpenAI is the only company that attaches equal importance to product design and model capabilities. There are no major complaints in the experience.
Could it be that the product is still being iterated?
Of course, I understand why some products are more restrained. Some people may say that the reason why Gemini App does not add video analysis and Claude App does not prompt after task interruption is because it is still in the MVP stage and the product has not been completed yet, so the strategic choice is to launch the model online and let users run it first.
This explanation makes sense at first glance, but the problem is, if the MVP lasts for more than a year, the core functions are not launched, and even the most basic system prompts, task uninterrupted, and file uploads with correct errors are not done well, then it is not an MVP, but the product is not taken seriously. Users can distinguish between strategic restraint and resource perfunctory.
Another argument is: most people don’t use complex functions, and doing too much will overwhelm the product rhythm, so it is right to keep it simple. This actually underestimates the essence of AI products. The real value of AI is not to replace a search engine or a knowledge question-and-answer tool, but to help users handle tasks that they cannot handle or do not have time to handle themselves - such as long documents, cross-modal materials, and complex planning. If a product cannot even undertake these tasks, it is destined to be regarded by users as nothing special, or even useless.
In short, no matter how simple or complex the task is, users don’t want their input to be wasted, and they don’t want the app to crash silently. This is not a problem of advanced functions, but a basic reliability problem. Many apps today can’t even do this.
#03
Reasons and opportunities
We are likely to be faced with two products made by different orgs, reporting to different VPs . Under this structure, the product manager of Gemini App may not know what the biggest highlight of the model is. After a round of research, he found that ChatGPT and Claude both support uploading pictures, but not videos, so he concluded: Then we don’t need them. Little did he know that video understanding is Gemini’s biggest advantage. (Pure speculation, not necessarily true)
What’s even more bizarre is that AI Studio actually does a better job. Why? Because it is for developers, and many of them are made by engineers themselves, which makes it closer to the model. It’s better to call it a debugging tool than a product. This design without design releases the model’s capabilities better than the App version with a product manager but no resources to support it.
Claude's problem is another structural problem. It is essentially a To B-oriented company, and API is its main business, accounting for 85% of its revenue . The To C client is just a display window of feature parity of "if others have it, I must have it too".
So we see that Claude App is very casual: it just works, no reminders when the user is disconnected, no saving of tasks, and long iOS output that gets hot. No one really cares about what users do with it, as long as people can do a test and know that its model is good.
On the other hand, OpenAI is the only company that must stand firm on both the To C and To B legs. ChatGPT is its flagship product, accounting for 73% of its revenue . More importantly, it is a small company with a simple reporting chain, and the product and model teams are closely tied together. It is hard to imagine that an OpenAI product manager would not know that his model can recognize videos. It can integrate this set of capabilities only because its organizational structure allows it to do so.
So back to the topic of this article: Why are AI models becoming more powerful, but users still not finding them useful? The most heartbreaking answer may be: it’s not the product itself that’s difficult to make, but the limitations of the company structure. But this also means that opportunities are still there.
At present, several major model manufacturers are competing to see whose models are bigger, more modal, and lower cost, but almost no one is really competing on product experience. There are structural obstacles and blind spots in the route behind this. They assume that if the model is strong, the product experience will naturally improve; as long as the ability is high, users will stay. This assumption has actually been falsified to a certain extent by the experience gap between ChatGPT and Gemini App.
Not all teams can take over a capability, and not all capabilities will automatically produce a good experience. This is a structural misunderstanding that has not been fully discussed in the industry, but it has given third-party teams a very realistic entry point.
If we know that the Claude 4 model is very stable, but the App crashes frequently, can we connect to the API to make a more stable asynchronous task App?
If we know that Gemini 2.5 is the best in video analysis, but the App does not even support video uploading, then can we simply use the sample code of AI Studio to package a lightweight client and enter the vertical market?
If we know that all apps are still thinking in a chat box, can we directly jump out of the conversation paradigm and design a new front-end structure based on multi-hour task scheduling?
These are all innovative paths that can be implemented without building models. Moreover, they are not products that may have a promising future, but user needs that exist now, but no one has seriously implemented them yet.
So let's go back to the beginning of the article. AI is not difficult to use. It's just that the AI that most people encounter is packaged in the wrong shape. The model is smart, but the app can't keep up. This gap in experience is not a technical gap, but the result of a long-term disconnect between product design and organizational decision-making.
We have entered an era where models are no longer scarce, but experience is. The next watershed of AI products may lie in whether you can discover the opportunities between these gaps.