Why do large models use Token? Why not use UTF8?

In-depth exploration of the core mechanism of large model processing language, revealing the difference between Token and UTF8 encoding.
Core content:
1. Token definition and its role in large models
2. Comparative analysis of Token and UTF8 encoding
3. The advantages of Token and its application in large models
Token, also called word unit, is the most basic unit of processing in the large language model. When users use the large model, they need to convert the text they input into tokens, and then input them into the large model, which predicts the next token. Finally, the predicted token is converted back into readable text to form the output. This article will analyze what tokens are, why tokens are used, what are the advantages and disadvantages of tokens. Most importantly, why the large model has to re-establish a token-based encoding mechanism instead of reusing the widely used UTF8.
The essence of tokens
Token is essentially the integer representation of a word. It is a core technology in natural language processing. It converts human-readable text into computer-friendly digital form, paving the way for big models to understand and generate language. Table 1 shows the token representation corresponding to a sentence. It is not difficult to see from the table that the token corresponding to me is 1, and the token corresponding to "big model" is 3. By analogy, assuming that the user input is "I love technology", then after conversion to tokens, it is [1,2,4]. Similarly, the model outputs [1,2,3], which is converted to human-readable content as "I love big model".
word | token |
---|---|
I | 1 |
like | 2 |
Large Model | 3 |
technology | 4 |
As mentioned above, tokens are essentially integer representations of words, which is a text encoding method. So why do we use tokens instead of UTF8 in large models? After so many years of development, UTF8 has been widely recognized and has become the de facto text encoding standard. So why do we use tokens instead of mature technologies in large models?
1. Large models focus on semantics rather than character composition
The purpose of character encoding (such as UTF8) is to represent and store various text symbols in computers. It is a universal character representation standard. The goal of the large model is to understand and generate natural language, which is not just a pile of characters. More importantly, it is to understand the meaning and relationship behind characters, words, and sentences. In the large model, individual characters are meaningless. They only make sense when these characters form words with semantics. Although encodings such as UTF8 can represent all characters, they are not helpful for understanding the relationship between words, the structure and meaning of sentences. If the large model directly uses characters as input, it needs to learn the mapping from characters to semantics from scratch, which will be very inefficient and difficult.
For example, a large model needs to understand the word "apple". The word "apple" needs to be understood as a whole, and the individual characters separated from it are meaningless. In addition, "apple" can refer to both fruit and a technology company. This kind of semantic ambiguity and context dependency cannot be directly expressed by character encoding. Large models need to map words to a more abstract semantic space in order to perform complex tasks such as reasoning and generation.
2. Token encoding is more efficient
Token encodes a word as a basic unit, which can greatly reduce the length of the sequence. Take the sentence "Word is a concept in natural language in the field of artificial intelligence" as an example. This sentence has 20 sub-words, but if it is encoded using the Qwen2.5 encoder, the length of the token obtained is only 10. According to the calculation that a token occupies 3 bytes, only 30 bytes are needed. If UFT8 encoding is used, 60 bytes are required. The sequence length can be reduced by half.
3. Tokens can improve computing efficiency
The sequence length is reduced, which directly reduces the size of the attention matrix in the self-attention mechanism and directly reduces the complexity of the model. At the same time, shorter sequences mean less computation, which can speed up the model training and reasoning process.
In summary, the application of tokens can greatly improve the computational efficiency of large models, so token encoding technology is widely used in large models.
Disadvantages of tokens
So are tokens perfect? Obviously not. The shortcomings of tokens are also obvious.
1. No unified standards
Since there is no unified standard, the input formats of models trained with different Tokenizers are incompatible. Different tokenizers will get different outputs for the same input. Table 2 shows the output tokens of the input "a word is a concept in natural language in the field of artificial intelligence" under the Tokenizers of different models. This limits the versatility and portability of the model to a certain extent.
tokens | ||
13 | 31892, 6753, 3221, 47243, 60319, 89497, 1616, 116258, 108329, 58159, 22912, 49062, 41276 | |
gpt-3.5 | 25 | 6744, 235, 24186, 21043, 17792, 49792, 45114, 118, 27327, 19817, 228, 35722, 253, 9554, 37026, 61994, 73981, 78244, 16325, 9554, 48044, 162, 25451, 26203, 113 |
Qwen2.5 | 10 | 99689, 23305, 20412, 104455, 104799, 99795, 102064, 101047, 46944, 101290 |
DeepSeek -R1 | 9 | 4055, 1673, 389, 33574, 29919, 4377, 7831, 60754, 9574 |
Table 2 Encoding results of different models for “A word is a concept in natural language in the field of artificial intelligence”
For multilingual models, how to design a Tokenizer and vocabulary that can effectively handle multiple languages is a more complex problem. Different languages have different character sets, vocabulary structures, and grammatical rules, which require more sophisticated designs to balance efficiency and performance.
2. The granularity of the tokenizer affects the performance of the model.
The granularity of a tokenizer refers to how the tokenizer combines multiple characters to form word units. For example, the word "artificial intelligence" can be considered as one word unit, or as two words, "artificial" and "intelligent". This is the granularity. The smaller the granularity, the more effectively it can handle problems such as spelling errors and similar characters, and the better it can handle word form changes (for example, roots, affixes). However, it will usually cause the sequence length to be very long and the computational efficiency to be low. On the contrary, the larger the granularity, the shorter the sequence length, the higher the computational efficiency, and the easier it is for the model to learn semantic relationships at the word level. However, it is difficult to handle word form changes. For example, "run", "running", and "ran" will be treated as different words. The processing effect on low-frequency words and rare words is poor.
3. Vocabulary size affects model performance
A small vocabulary means that the model has relatively fewer parameters, which can reduce the model's storage space and computing power , make training and reasoning faster, and make it easier to deploy in resource-constrained environments. However, many words will be processed as out-of-date (OOV) words . This will cause serious loss of model input information, and the model will not be able to effectively understand the semantics of the text , thereby reducing performance. In addition, a small vocabulary may also limit the model's ability to capture fine-grained semantic information , such as being unable to distinguish synonyms or understand professional terms.
A large vocabulary means that it can accommodate more words, reduce the OOV rate , and the model can handle richer language phenomena. In theory, it can improve the model's ability to understand complex semantics. However, it directly leads to a sharp increase in the number of model parameters, a larger model size, slower training and inference speeds, and higher requirements for computing resources. An overly large vocabulary may also make the model more likely to overfit the training data and reduce generalization capabilities. In addition, if the vocabulary is too large, especially when the vocabulary contains a large number of low-frequency words, it will increase the difficulty of learning the model, because the model needs to assign parameters to these low-frequency words and learn their representations, but these low-frequency words appear only a limited number of times in the training data, making it difficult to fully learn.
Summarize
This article introduces the definition of word units and analyzes why we should use word units. Correct use of word units can significantly improve model performance, but word units are not so easy to use well. Granularity and vocabulary size will affect model performance, and a balance needs to be found according to the actual scenario. Tokenizers of different models cannot be used universally.