As we all know! 6 common misunderstandings in building large model applications

Written by

Clara Bennett

Updated on:June-13th-2025

introduction

Artificial intelligence (AI) is changing the way we live and work at an unprecedented speed. From voice assistants to smart cockpits, and from medical diagnosis to financial risk prediction, AI's application scenarios are constantly expanding, showing great potential and value. "Despite this, we are still in the early stages of building large-model applications, and we often make mistakes when facing complex applications ." To this end, this article lists and shares the six common misunderstandings in building large-model applications (if you have participated in the development of large-model applications, you may have encountered similar problems).

Myth 1: Forcibly using generative AI

Whenever a new technology emerges, it seems that you can hear a collective sigh from senior engineers. Generative AI is no exception - it seems to have infinite power, but it only exacerbates people's urge to apply it to everything. Someone once proposed an idea to use generative AI to optimize household energy consumption. That is, input a list of high-energy activities in the family and the hourly electricity price into a large model, and then ask it to develop a schedule that saves the most money. Experimental results show that this approach can reduce electricity bills by about 30%. But think about it carefully, what if you simply schedule high-energy activities when electricity prices are lowest, such as washing clothes and charging cars after 10 pm?

In fact, planning based on generative AI is also effective with greedy strategies. Even if the effect is not good, there are many cheaper and more reliable optimization methods, such as linear programming. There are many similar situations. For example, a large company wants to use generative AI to detect abnormal network traffic; another wants to predict the number of calls from customers; and a hospital wants to determine whether a patient is malnourished.

There is certainly value in exploring new approaches and understanding possibilities, as long as you understand that you are not solving a problem, but testing a solution. “‘We solved the problem’ and ‘We used generative AI’ are two completely different titles, but many people prefer the latter.”

Misconception 2: Mistaking “bad product” for “bad AI”

At the other extreme, many teams completely rejected generative AI after poor user feedback after trying it. But other teams have been successful in similar scenarios. The survey found that the main problem is not AI, but the product. Many people told me that the technical part of their AI application is not difficult, but the user experience (UX): What should the interface look like? How to seamlessly integrate into the user's workflow? How to introduce human supervision? UX has always been difficult, but it is even more difficult in generative AI. We know that it is changing reading, writing, learning, teaching, work, entertainment... but we don't know what the "future" will look like.

Here are a few examples that seem simple, but actually have very counterintuitive user feedback:

A friend of mine made a meeting minutes summary tool. At first, they focused on the length of the summary - did users prefer a three-sentence summary or a five-sentence summary? But later they found that users didn't care about the content of the summary at all, they only cared about the "to-do items related to themselves" in the meeting .

A company developed a chatbot for job matching and found that users do not seek "correct" answers, but "helpful" answers. For example, a user asks: "Am I suitable for this position?" The robot answers "You are not suitable at all" - although it is correct, it is of no help to the user. In fact, users want to know: What is the gap? How can I make up for it?

Another company developed an AI robot that answers tax questions. Initially, users had a lukewarm response and thought it was useless. After investigation, it was found that "users don't like typing ." Facing a blank input box, they didn't know what the robot could do or what to type.

So they added a few recommended questions to each interaction for users to click. This way, users are more willing to try, trust is established with more use, and feedback becomes more positive. "Now everyone uses the same model, AI technology tends to be homogenized, and the real product difference lies mainly in product design."

Myth 3: It’s too complicated to start with

Common examples include:

Use an agent framework to handle problems, but calling the API directly is enough;
Struggling with which vector database to use, when a simple keyword-based search can solve the problem;
Keep fine-tuning the model, but prompting alone is enough.
Use semantic caching;

There are so many new technologies now that it’s easy to want to use all the cool tools right away. But introducing these complex tools too early will bring two problems:

Abstracting too early, hiding key details and making it impossible to understand and debug the system;
Introduced additional bugs;

Tool authors make mistakes, too. Typos in default tooltips have been found in many frameworks’ code. If the framework you use updates its tooltips without telling you, your app’s behavior may have changed without you knowing it.

Of course, “abstraction is good” , but it must be introduced after maturity. In this early stage of AI engineering, best practices are still being formed, and we must be extremely careful when using any abstraction tool.

Myth 4: Demo is easy but optimization is difficult

For many companies developing large-scale model applications, it took 1 month to achieve 80% of the experience they wanted, and then it took another "4 months" to increase from 80% to 95%. The rapid progress at the beginning made them seriously underestimate the difficulty of subsequent optimization, especially in reducing hallucinations. A startup company that makes e-commerce AI sales assistants once said: The time from 0 to 80% is the same as the time from 80% to 90%. The problems they encountered include:

Accuracy vs. Latency : The more planning and self-correction, the more process nodes, and the higher the latency;
“Tool call” : It is difficult for AI to distinguish between multiple similar tools;
"Tone of voice" problem : For example, it is difficult to achieve complete consistency in the system prompt "speak like a luxury brand concierge";
"Intent understanding" : It is difficult to accurately understand the customer's real needs;
"Testing difficulty" : The request combinations are almost infinite, and it is difficult to build a complete unit test;

In the UltraChat paper, Ding et al. also pointed out: "It is easy to go from 0 to 60, but it is extremely difficult to go from 60 to 100." This is one of the earliest painful lessons learned by AI product developers. "It is easy to make a demo, but it is difficult to make a product."

Besides illusions, latency, accuracy/latency tradeoffs, tool usage, tips, testing, etc., there are:

"API instability" : A team once said that 10% of API requests timed out. I have encountered this problem many times when experiencing Agent applications.
"Compliance issues" : such as model output copyright, data access/sharing, user privacy, security risks brought by the retrieval system, and unclear sources of training data;
"Safety issues" : the product may be misused or produce offensive content;

Remember to consider these potential obstacles when setting product milestones and resource planning. A friend calls this “ cautious optimism .” Remember: “Many cool demos don’t turn into great products . ”

Misconception 5: Abandoning manual evaluation

In order to automatically evaluate AI applications, many people choose the "AI evaluates AI" (LLM-as-a-judge) method. A common mistake is to "rely entirely on AI review without human evaluation . " "AI review is certainly useful, but it is not necessarily reliable ." Its effectiveness depends on the model, prompt words and application scenarios behind it. If the AI review is not designed properly, it may give misleading scores. AI review should also be continuously optimized like other AI applications. Some good model products basically have a manual evaluation mechanism, and some samples (ranging from 30 to 1,000) are manually evaluated every day. The main reasons are as follows:

Compare human and AI scores : If human scores drop but AI scores rise, it’s time to review the review model.
"Deeper understanding of user behavior" : may provide you with optimization ideas;
"Discover hidden changes in user behavior in the data" : especially those related to current events that cannot be discovered by automatic exploration;

The reliability of human evaluation also depends on clear annotation guidelines. Good guidelines can help you improve the prompt words. If people can't understand them, the model won't understand them either. And these guidelines can also be used to construct the subsequent fine-tuning data.

For some projects, you only need to "look at the data for 15 minutes" to find some key problems. Manually checking data is actually the most valuable thing, but also the most undignified thing.

Myth 6: Using crowdsourcing to decide direction

In order to catch up with the generative AI trend in the early days, the company did not have a clear idea of which application direction to focus on, so it "crowdsourced" ideas from the entire company. "We have hired so many smart people, let them tell us what to do." As a result, we got millions of text-to-SQL models, Slack robots, and countless code plug-ins.

Of course, it is right to listen to employees’ suggestions. But individuals tend to focus on the issues that have the greatest impact on their daily work, rather than the issues that have the highest ROI for the company. “Without the guidance of an overall strategy, it is easy to fall into a series of low-impact, fragmented projects, and ultimately come to the wrong conclusion that ‘generative AI has no value’.