Woter AI detection.Hurry - ends Jul 18th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

From burning money to landing, it’s time for the big model to be “accepted”

Written by

Clara Bennett

Updated on:July-09th-2025

After the price of general large models was calculated in cents, ByteDance once again brought multimodal large models into the "centimeter era".

At the Volcano Engine Force Conference held on December 18, 2024, ByteDance officially released the Doubao Visual Understanding Model. At the same time, it was announced that its price is 0.003 yuan per thousand tokens, which is equivalent to processing 284 720P pictures for one yuan.

Compared with Claude 3.5 Sonnet's 0.021 yuan/thousand tokens, Qwen-VL-Max's 0.02 yuan/thousand tokens, and GPT-4o's 0.0175 yuan/thousand tokens, Doubao's visual understanding model is 85% cheaper than the industry price.

Prior to this, in May, ByteDance released the Doubao big model, whose main model was priced at 0.0008 yuan/thousand tokens in the enterprise market. 0.8 cents can process more than 1,500 Chinese characters, which is 99.3% cheaper than the industry average, allowing the big model to be priced in cents instead of cents.

This move forced Alibaba Cloud to reduce the prices of its three main Tongyi Qianwen models again, with the highest reduction reaching 90%. Baidu Smart Cloud even directly announced that the two main models of Wenxin Da Model, ENIRE Speed and ENIRE Lite, would be completely free.

How did ByteDance catch up in the AI big model market in less than a year and a half? How far has the multimodal big model developed? What will be the new trend of big model technology in the future?

01.
“The King of Curls” Doubao comes from behind to take the lead?

2023 is the year of the "big explosion" of domestic large models.

Since March last year, many large manufacturers and innovative companies have launched their self-developed big model products: Alibaba Tongyi Qianwen 1.0, Tencent Hunyuan, 360 Zhinao, Huawei Pangu, iFlytek Spark, SenseTime Ririxin, Baichuan Big Model and Zhipu AI's GLM, etc., were all born this year.

As a latecomer in AI, ByteDance only established a large model R&D team in January last year, and released the "Skylark" large model and the external testing AI dialogue product "Doubao" in August.

Timeline of domestic large-scale model development Source: First New Voice

Although we came late, we can't keep up with others' rapid growth.

According to data from Quantum位think tank, as of the end of November, the cumulative user scale of Doubao in 2024 has exceeded 160 million; in November, an average of 800,000 new users downloaded Doubao every day, and the number of active users per day was nearly 9 million, second only to OpenAI's ChatGPT, ranking second in the world and first in China.

As for the Doubao Universal Model released in May this year, according to data released by ByteDance, as of mid-December, the average daily token usage of the Doubao Universal Model has exceeded 4 trillion, a 33-fold increase from when it was first released seven months ago.

The growth of "Doubao" is inseparable from the strong promotion of ByteDance, the "King of Volume".

The first is volume flow.

According to App Growing statistics, as of November 15, Kimi, Doubao, Xingye and other ten large model products in China have put out more than 6.25 million ads, with a total investment of 1.5 billion yuan. Among them, Kimi and Doubao are the two products with the most crazy advertising, with 540 million yuan and 400 million yuan respectively.

Image source: App Growing

At present, spending money to buy traffic is the most direct and quickest way to launch AI products. Among the various advertising channels, ByteDance's massive engine (an advertising platform under ByteDance, covering marketing resources such as Toutiao, Douyin, and Xigua Video) is basically indispensable .

This has enabled Doubao, which is backed by ByteDance, to maximize the advantages of its traffic pool. On Douyin, ByteDance has almost blocked the delivery of all AI applications except Doubao, leaving only Doubao for its own app. Although it is uncertain whether heavy investment in traffic can be exchanged for a super app, at least it has brought visible user growth to Doubao.

The second is the volume product.

From chat assistants and video tools to entertainment applications and office fields, ByteDance has launched more than a dozen AI applications, covering almost all major AI product directions. In October this year, ByteDance also launched the Ola Friend headset that can communicate with Doubao voice, and is also developing AI glasses recently.

On the one hand, such saturated research and development can enable the Doubao large model to accelerate iteration by relying on many AI applications. On the other hand, it is expected that AI hardware terminals can broaden the usage scenarios of the Doubao large model, thereby realizing the closed loop of the entire "Doubao+" industrial chain.

In addition, Doubao is also expanding its application scenarios in order to achieve greater success at the application level.

It is understood that Doubao Big Model has cooperated with 80% of mainstream automobile brands and connected to many smart terminals such as mobile phones and PCs, covering about 300 million terminal devices. The number of Doubao Big Model calls from smart terminals has increased 100 times in half a year. In the past three months, the number of Doubao Big Model calls in information processing scenarios has increased by 39 times, customer service and sales scenarios have increased by 16 times, hardware terminal scenarios have increased by 13 times, AI tool scenarios have increased by 9 times, and learning and education scenarios have also increased significantly.

It can be said that the rich internal ecology, continuous resource investment, huge amounts of high-quality data and application scenarios, all fully connected to AI and interconnected, are the secrets to Doubao's becoming the industry's "king of volume".

02.
The second half starts with a multi-modal competition

Since OpenAI launched Sora, making it possible to “generate a video from a single sentence”, and Google released Gemini, which can generalize and seamlessly understand, operate and combine different types of information, major domestic companies have begun to follow up and deploy multimodal AI applications such as video, music, and voice.

For example, starting from May this year, Vidu, Kuaishou Keling, ByteDance, Zhipu Qingying, SenseTime Vimi and others have successively released Wensheng video models; in September, MiniMax officially released the video model video-01, Alibaba Cloud released the new Tongyi Wanxiang video generation model at the Yunqi Conference, and Meitu announced that the MiracleVision big model has completed the upgrade of video generation capabilities; in November, Tencent Hunyuan big model officially launched video generation capabilities, and Kimi under the Dark Side of the Moon was revealed to be internally testing the AI video generation function "Kimi Creative Space"... The "multi" in multimodality is becoming a new development direction.

You can use this function through Tencent Yuanbao APP-AI Application-AI Video. Image source: Tencent Youtu Lab

The Doubao visual understanding model released by Volcano Engine this time is said to have the following main capabilities:

Stronger content recognition capabilities: Not only can it identify basic elements such as object categories and shapes in images, but it can also understand the relationships between objects, spatial layout, and the overall meaning of the scene.
Stronger understanding and reasoning abilities: Not only can it better recognize content, but it can also perform complex logical calculations based on the recognized text and image information.
More detailed visual description capabilities: Based on image information, the content presented by the image can be described more delicately, and a variety of writing styles can be created.

Following GPT-4's milestone breakthrough in language, the industry generally believes that "vision" is the next explosive track. After all, 80% of the five senses of human beings are visual information, and future large models should also make full use of more types of senses to explore the path to achieve AGI.

Tan Dai, president of Volcano Engine, also said in an interview that the launch of the visual understanding model is equivalent to unlocking a large scenario. Compared with the AI in the past that only had the form of text dialogue, the integration of chat function with deep reasoning, image visual understanding and other capabilities can enable the model to handle a large amount of comprehensive information in the real world and assist humans in completing a series of complex tasks.

For example, in tourism scenarios, it can help tourists read foreign-language menus and explain the background knowledge of buildings in photos; in education scenarios, it can optimize essays and popular science knowledge for students; in office scenarios, in addition to identifying content, the model can also help users analyze data relationships in charts and process code logic.

Doubao Visual Understanding Model Educational Scenario Application Case Image Source: Volcano Engine Force Conference

In addition to launching the visual understanding model, Volcano Engine has also released and upgraded several other models. For example, Doubao Universal Model Pro has been fully aligned with GPT-4o; the music model has been upgraded from generating a simple 60-second structure to generating a complete 3-minute work; the Wenshengtu model version 2.1 is connected to Jimeng AI and Doubao App...

It can be seen that although the Doubao series large models were not released early compared to similar products on the market, they have been updated at a relatively fast speed, and the latest capabilities have been quickly opened to ordinary users through applications such as Jimeng AI and Doubao App.

At present, the focus of the AI market is gradually shifting from "big model" to "big model +". In addition to conventional AI text dialogue applications, the "multi" of multimodality is becoming a new direction.

03.
It’s time for the big model to be “accepted”

At the 2024 World Artificial Intelligence Conference, Baidu founder Robin Li mentioned in his speech that "a hundred-model war broke out in China in 2023, which actually caused a huge waste of social resources, especially computing power." Indeed, whether it is the technical R&D costs or the application operating costs, every step of the growth of large models requires the support of real money.

As the industry returns to rationality, more and more AI companies realize that the number of volume parameters, the number of volume tokens, the size of the volume cluster, and the volume price are actually meaningless, and the commercialization of large models is the issue that needs most attention.

According to the type of end user, the business model of AI big models can be divided into to C and to B.

lto C: It is aimed at individual consumers, including free and paid subscription models. Free models include Tencent Yuanbao and Baidu's Wenxin Yiyan (version 3.5) ; paid subscription models include Baidu's Wenxin Yiyan (version 4.0) and OpenAI's ChatGPT (version 4.0) ;

lto B: It is for enterprises, including API call authorization and SaaS model. In the API call authorization model, enterprise customers can integrate AI functions in their own applications or services, and are usually charged based on the number of calls or data volume, such as Alitong Yi Qianwen and Zhipu AI; in the SaaS model, large model enterprises provide software services to customers, and customers do not need to install and maintain software, such as Google Cloud AI. In actual applications, large model enterprises usually use a mixture of multiple business models.

Image source: AI drawing

The current fierce competition for multimodal large models will drive many industries to reshape their production processes, and will inevitably trigger a new round of upgrades and competition in the following areas:

Audiovisual creation: When the big model shifts from single-modal generation to multi-modal generation, the application of AIGC lowers the threshold for professional creation. This will change the production model of the audiovisual media industry, shape a new content production paradigm, and achieve the goals of improving creation efficiency, expanding creation space, and improving the quality of works.
Emotional intelligence: Based on the latest AI models such as GPT-4o and Gemini 1.5 Pro, AI companionship in the future will greatly enhance the interactive experience through technologies such as streaming speech recognition, multimodal AI and emotional computing. This means that multimodal large models will give machines emotional value and meet users' diversified companionship demands by deeply analyzing their emotions and behaviors.
Intelligent Industrial Manufacturing: In the future, multimodal large models are expected to complement and integrate with the currently commonly used special small models, deeply empowering all aspects of industrial manufacturing, and further upgrading perception and understanding capabilities with the integration and accumulation of scene data to meet the personalized needs in production and manufacturing, thereby promoting industrial transformation.

In short, the core of competition in the field of AI has changed from the battle of whether or not to have large models to the battle of applications. At this stage, the competition is no longer about macro concepts, but about implementation capabilities and commercialization progress.

With the continuous iteration and upgrading of domestic large models, coupled with the gradual easing of domestic GPU supply issues and policy guidance, the demand for domestic large model training and inference computing power is expected to be gradually released. This will not only further accelerate the implementation of large models, but also bring new industry opportunities to the AI era.