The consumption of large model tokens may be a confusing account

Written by
Silas Grey
Updated on:July-12th-2025
Recommendation

In-depth understanding of the resource consumption problem in large model applications, providing important reference for technical decision makers.

Core content:
1. Comparison of resource consumption between large model applications and Web applications
2. Key factors affecting large model token consumption
3. How to effectively control token consumption and avoid resource waste

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

If you are deploying a large-scale application, be sure to warn your CEO in advance that large-scale applications are far less controllable than Web applications in terms of resource costs.

For classic Web applications, such as e-commerce, games, travel, new energy, education, and medical care, CPU consumption is controllable and positively correlated with the number of online users and login duration. If computing resources increase suddenly, it may be due to activities of the operations team or unexpected traffic bursts. After elastic expansion of the server, it will be reduced to the normal state after a period of stability. The resources consumed by the backend are traceable and controllable. However, the token consumption of large models is not.


Table of contents

01 What factors are related to the consumption of large model tokens?

02 The hidden source of large model token consumption

03 Agent’s resource consumption ledger is more complicated

04 A preliminary study on how to control abnormal consumption of tokens

05 Conclusion


01 What factors are related to large model resource consumption?

According to an article in Quantum Bit [1] , when the input is "the distance between two paths in a tree", DeepSeek will fall into infinite thinking. The author measured that the thinking time was as long as 625 seconds (as shown below), and the output word count reached 20,000 words. This sentence is not a complex and meaningless garbled code. It looks like a completely ordinary question. If you have to find fault, it is that the expression is not complete.



This endless repetitive thinking is a mental drain on the model itself and will cause a waste of computing resources. If abused by hackers, it is tantamount to a DDoS attack on the inference model. So what other factors are related to the token consumption of large models, in addition to the number of people online and the duration of time?


This article only takes DeepSeek as an example. The billing rules and factors affecting billing for other large model API calls are similar.


According to the official DeepSeek billing document [2] , the API call fee is related to the following parameters:


  • Model type: V3 and R1 have different unit prices per million tons. R1 is priced higher than V3 because it has inference function.
  • Number of tokens input: Billed per million tokens, the more you use, the higher the fee
  • Output quantity of tokens: Billed in millions of tokens. The more you use, the higher the fee. The output unit price is higher than the input unit price.
  • Whether the cache is hit : The unit price of a cache hit is lower than that of a cache miss
  • Busy and slack time: The unit price is lower during slack time
  • Thinking chain: will consume the output quantity of token


In addition, the search request and the returned data processing (before content generation) during the online search process will also be counted as token usage. Any awakening of the big model consciousness will actually consume tokens.


According to this billing rule, the resource consumption of large models will be related to the following factors:


  • Length of user input text: The longer the user input text, the more tokens are consumed. Usually 1 Chinese word, 1 English word, 1 number or 1 symbol is counted as 1 token.

  • The length of the model output text: The longer the output text, the more tokens are consumed. Taking DeepSeek as an example, the unit price of output tokens is 4 times that of input.

  • Context size of user input: In the case of context, since the model needs to read the content of the previous rounds of conversations before generating content, it will significantly increase the input of the model, resulting in an increase in tokens.

  • Complexity of tasks: Complex tasks may require more tokens. For example, generating long texts (such as translating and interpreting papers) and performing complex reasoning (such as math and science problems) require more tokens. If it is a multimodal, complex agent, it usually consumes more tokens than a conversational robot.

  • Special characters, formats, and tags: may increase token consumption. For example, HTML tags, Markdown formats, or special symbols will be split into multiple tokens.

  • Different languages ​​and encodings: have an impact on token consumption. For example, Chinese usually consumes more tokens than English because Chinese characters may require more encoding space.

  • It is related to the model itself . For example, for the same model, high parameters may output more detailed content, which makes it easier to consume tokens, just like a taller and stronger person consumes more energy per unit of exercise. For another example, if the inference layer is not optimized or poorly optimized, it is more likely to output more invalid and low-quality content, which makes it easier to consume tokens, just like a person who has been trained and mastered the skills will control his breathing rhythm and consume less energy during exercise.

  • Whether the deep thinking function is used: The output token number includes all tokens of the thinking chain and the final answer, so turning on deep thinking will consume more tokens.

  • Whether the networking function is used: Networking requires the model to search external knowledge bases or websites to obtain external knowledge, which will be consumed as input tokens. The output content includes external links and external knowledge base content, which will be consumed as output tokens.

  • Whether the semantic cache function is used: Since the unit prices of cache hits and misses are different, the semantic cache function can reduce resource consumption. If you further optimize the cache algorithm yourself, you can reduce resource consumption even more.

02 Invisible factors of resource consumption of large models

In addition to the factors mentioned above, there are many invisible factors that can lead to abnormal consumption of resources in large model applications.


Code logic vulnerability

  • Uncontrolled cyclic calls: Due to incorrect configuration of the retry mechanism, a single user session generates repeated calls.
  • Lack of caching mechanism: Caching is not enabled for frequently repeated questions, resulting in the token being used to repeatedly generate similar answers.


Prompt word: Engineering defect

  • Redundant context carrying: If the complete conversation history is carried, the number of tokens for a single request will increase significantly. The longer the context conversation, the greater the consumption.
  • Inefficient instruction design: Unstructured prompt words will reduce the efficiency of model generation.


Ecological dependence risk

  • Plugin call black hole: The plugin call depth is not limited, and a single query triggers multiple repeated chain calls.
  • Third-party service fluctuations: Vector database response delays lead to timeout retries, which indirectly increase token consumption.


Data Pipeline Flaws

  • Defects generated during data preprocessing: Data cleaning, preprocessing, and standardization are conventional means to improve input quality. For example, typos, missing values, and noisy data during user input can be completed and corrected through data cleaning, preprocessing, and standardization. However, there is also a possibility that input defects may occur during the completion and correction process, resulting in abnormal consumption of resources.

03 Agent’s resource consumption ledger is more complicated 

Speaking of Agent, we have to mention MCP which has become popular recently.


In January, we published "10 Questions About MCP | Quickly Understand the Model Context Protocol"

In March, we will also release "A Brief Analysis of MCP Monetization". Welcome to follow Higress official account

MCP replaces fragmented integration methods with a single standard protocol in the interaction between large models and third-party data, APIs, and systems [3] . It is an evolution from N x N to One for All, eliminating the need for repeated writing and maintenance of interface codes for different external systems, and enabling AI systems to obtain the required data in a simpler and more reliable way.


Before the emergence of MCP, agents needed to use tools to connect to external systems. The more complex the planning task, the more external systems would be called and the more times they would be called, which would bring high engineering costs. Take the flowchart of Higress AI Agent below as an example. When a user sends a message saying "I want to drink coffee near Wudaokou in Beijing, please recommend it to me", the agent needs to use tools to call the APIs of Amap and Dianping. If the model self-correction process is introduced, the call frequency will increase further.


The emergence of MCP will accelerate the emergence of a wave of MCP server providers.


For example, Firecrawl officially introduced the MCP protocol through integration with the Cline platform in January this year. Users can use Firecrawl's MCP server to call its fully automatic web crawling capabilities, avoiding the crawling process of connecting to target web pages one by one, and accelerating the development of agents. Yesterday, OpenAI released the Responses API and open-sourced the Agents SDK. I believe that MCP and OpenAI will serve as the two main story lines for agents to reshape the labor market.


We will increasingly understand this view that "AI targets the operating expenses of the enterprise rather than the budget for traditional software."  Click here to learn more about the forward-looking views on AI in 2025.


Back to the Agent, compared with the conversational robot, the Agent's planning and execution process is more complicated and will consume more tokens. The following is a picture made by Zhihu author @tgt  . From it, we can see that starting from the input, the Agent's planning, memory, calling the external system, and executing the output, these processes will wake up the big model and consume tokens. If a self-correction process is added before the output content to improve the output effect, the token cost will increase further.


Manus, which has become popular recently, has demonstrated many user cases with good execution results, but it also brings huge consumption of computing power behind it. In general, the maturity of Agent will greatly increase the number of calls to the basic model.


04 A preliminary study on how to control abnormal consumption of tokens

Since the factors that cause model resource consumption are numerous and complex, it is not enough to be solved by just one product or solution. A complete engineering system needs to be established before, during, and after the event. Since we are still in the early stages of token consumption, the following is only a starting point. I believe that we will see more practices of lean large model costs.

(1) Before an abnormal call occurs: preventive measures

a. Establish real-time monitoring and threshold warning system
    • Monitoring system: Deploy a resource monitoring dashboard to track  basic observation indicators such as metrics, logs, traces  , and tokens in real time. Once an abnormal call occurs, it can quickly troubleshoot the fault and be used for flow control. [4]
    • Access control: Perform permission classification and access control on user identities (such as API keys) and provide consumer authentication functions, such as limiting high-frequency calls to avoid sudden resource usage caused by malicious or erroneous operations. [5]

b. Data preprocessing
    • Format Check
      : Before calling the model, check the format, length, sensitive words, etc. of user input, filter invalid or abnormal requests (such as overlong text, special symbol attacks), and reduce invalid token consumption.
    • RAG effect optimization technology: Use metadata to perform structured search before vector retrieval to accurately find target documents and extract relevant information, shorten input length, and reduce token usage.
    • Semantic caching: This improves the latency and cost of reasoning by caching large model responses in an in-memory database and in the form of a gateway plug-in. The gateway layer automatically caches the corresponding user's historical conversations and automatically fills them into the context in subsequent conversations, thereby enabling the large model to understand the context semantics and reduce the token consumption caused by cache misses. [6]

c. Parameter tuning
    • Temperature parameter tuning
      Tuning the parameters of the model to control the output behavior of the model. For example, adjusting the temperature parameter can affect the randomness of the model output. Lowering the temperature value can make the model output more deterministic and reduce unnecessary token generation. For example, DeepSeek officially recommends setting the temperature to 0.0 for code generation/math problem solving and 1.3 for general dialogue .
    • Output length preset
      : When calling the model, pre-set the maximum length of the output. According to the requirements of the specific task, clearly inform the model of the approximate range of the output. For example, when generating a summary, set the output length to no more than 4k to avoid the model generating too long text. DeepSeek supports a maximum output length of 8K.


(2) When an abnormal call occurs: real-time processing

a. Alarm and current limiting blocking mechanism
    • Alarm: Set dynamic baseline thresholds for core indicators such as token consumption, call frequency, and failure rate. Once the threshold is exceeded, an alarm is triggered.
    • Current limiting and circuit breaking: When a sudden increase in token consumption or an abnormal failure rate is detected, which can be URL parameters, HTTP request headers, client IP addresses, consumer names, or key names in cookies, current limiting or even blocking is automatically triggered to protect core functions and control the explosion radius. [7]

b. Abnormal call tracing and isolation
    • Temporary ban: Locate the source of abnormal calls (such as specific users, IP addresses, or API interfaces) through log analysis, and temporarily ban the abnormal requester to prevent further waste of resources.


(3) After an abnormal call occurs: recovery and optimization

a. Data compensation and code repair
    • Reduce statistical errors: Statistics are corrected due to data update delays (such as missing token consumption records). Data is recalibrated through offline computing tasks to ensure the accuracy of billing and monitoring systems.
    • Code review and fixes
      : Review the code that calls the large model and fix possible logical errors or vulnerabilities. For example, check whether there is a loop calling the model to avoid abnormal token consumption caused by infinite loops.

b. Attack tracing and defense strategy upgrade
    • Analyze abnormal call logs: Identify whether it is an adversarial attack (such as poisoning attacks or maliciously generated requests), update blacklist rules, and deploy input filtering models.
    • Enhanced identity authentication mechanism: such as two-factor authentication, to prevent resource abuse caused by API Key leakage.
    • Improvement of automated early warning and processing mechanisms
      Improve the automated warning and processing mechanism to improve the system's ability to respond to abnormal token consumption. For example, optimize the alarm rules to make the alarm more accurate and timely; improve the exception handling process to improve processing efficiency.

c. Long-term optimization measures
    • Token hierarchical management: Allocate tokens with different permissions to different businesses to reduce the exposure risk of core service tokens.
    • Automated testing and drills: Regularly simulate token abnormal scenarios (such as expiration and invalidation) to verify the effectiveness of the fault-tolerant mechanism.

05 Conclusion

In the past, we invested a lot of time and energy in improving the utilization of infrastructure resources. At present, all companies engaged in AI Infra are focusing on resource utilization, from the underlying hardware, model layer, reasoning optimization layer, to the gateway entry layer. This will be a long-distance race between engineering and algorithms.