What are Tokens and Embeddings in LLM?

Written by
Caleb Hayes
Updated on:June-30th-2025
Recommendation

Explore the concepts of Token and Embedding in LLM and understand the key steps of NLP tasks.

Core content:
1. The definition of Tokens and their importance in NLP
2. Common Tokenization methods and their application examples
3. Code examples demonstrating the output results of different Tokenization methods

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


01


introduction



The context length of GPT4 Turbo is 128K tokens.

The context length of Claude 2.1 is 200K tokens.

So here comes the question...

What are the tokens mentioned above?

Let's take a look at a simple example: input the sentence "It's over 9000!"

We can represent this as ["It's", "over", "9000!"] Each individual element in the array can be called a Token.

In the field of natural language processing, Token is the smallest unit of analysis we define. How to call a Token depends on the tokenization method you use; there are many such methods, and creating a Token is basically the first step in most NLP tasks.





02

Tokenization Methods in NLP

Let's take a look at some common methods for Tokenizing input text through the following code examples .

# Example string for tokenizationexample_string = "It's over 9000!"# Method 1: White Space Tokenization# This method splits the text based on white spaceswhite_space_tokens = example_string.split()# Method 2: WordPunct Tokenization# This method splits the text into words and punctuationfrom nltk.tokenize import WordPunctTokenizerwordpunct_tokenizer = WordPunctTokenizer()wordpunct_tokens = wordpunct_tokenizer.tokenize(example_string)# Method 3: Treebank Word Tokenization# This method uses the standard word tokenization of the Penn Treebankfrom nltk.tokenize import TreebankWordTokenizertreebank_tokenizer = TreebankWordTokenizer()treebank_tokens = treebank_tokenizer.tokenize(example_string)white_space_tokens, wordpunct_tokens, treebank_tokens

The output after running the above code is as follows:

(["It's", 'over', '9000!'], ['It', "'", 's', 'over', '9000', '!'], ['It', "'s", 'over', '9000', '!'])

Each of the above methods has its own way of breaking sentences into tokens. You can create your own method if you like, but the basic principles are the same.

So why do we need to tokenize the input text? The reasons can be summarized as follows:

Break down complex text into manageable units.

Presents text in a format that is easier to analyze or manipulate.

Suitable for specific language tasks such as syntactic parsing and named entity recognition.

Unify the preprocessing of text and creation of structured training data in NLP applications.

Most NLP systems perform some operations on these tokens to complete a specific task. For example, we can design a system to process a string of tokens and predict the next token. We can also convert tokens into speech representations as part of a text-to-speech system. We can also complete many other NLP tasks such as keyword extraction, translation tasks, etc.





03


How to use these tokens?

In the previous section, we introduced different methods for breaking text into Tokens. So the next question is how do we make good use of these Tokens?
  • Feature Extraction: Tokens are used to extract features that are input into machine learning models. Features may include the tokens themselves, the frequency of the tokens, the position of the tokens in the sentence, etc. For example, in sentiment analysis, the presence of certain tokens may strongly indicate positive or negative sentiment.
  • Vectorization: In many NLP tasks, phrases are converted into numerical vectors using techniques such as Bag of Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (e.g., Word2Vec, GloVe). This process converts text data into numbers that machine learning models can understand and process.
  • Sequence modeling: In tasks such as language modeling, machine translation, and text generation, tokens are used in sequence models such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), or Transformers. These models learn to predict the next token sequence by understanding the context and the likelihood of the token appearing.
  • Training the model: During the training phase, the model receives tokenized text and corresponding labels or targets (such as categories for classification tasks or the next token for language models). The model learns the patterns and associations between tokens and the desired output.
  • Context Understanding: Advanced models like BERT and GPT use tokens to understand the context and generate embeddings that capture the meaning of words in a specific context. This is crucial for tasks where the same word may have different meanings depending on its usage.

If you are a beginner, don't worry about these keywords you just read. In simple terms, we convert text input into independent units called "Tokens". This way, it will be easier to convert them into "numbers" that computers can understand later.





04


Tokens in ChatGPT

What do tokens look like in LLMs like ChatGPT? The tokenization methods used for LLMs are different from those used in general NLP tasks.
In a broad sense, we can call it " subword tokenization ", where the tokens we create are not necessarily complete words, as we saw in space-based tokenization. This is exactly why a word is not equal to a token.
When they say the context length of GPT-4 Turbo is 128K tokens, it’s not exactly 128K words, but a number close to it.

Why use such a different and complex tokenization approach?

The reasons can be summarized as follows:

These tokens are more complex language representations than complete words.

They help in processing large vocabularies, including rare and unknown words.

Using smaller subunits is more computationally efficient.

They help to better understand the context.

It is more adaptable to different languages, which may be very different from English.







05


Byte pair encoding

Many open source models, such as Meta's LLAMA-2 and earlier GPT models, use a version of this Byte-Pair-Encoding approach. In real-world applications, BPE analyzes large amounts of text to identify the most common pairs.
Let’s take a simple example using the Tokenizer in GPT-2.
from transformers import GPT2Tokenizer# Initialize the tokenizertokenizer = GPT2Tokenizer.from_pretrained("gpt2")text = "It's over 9000!"# Tokenize the texttoken_ids = tokenizer.encode(text, add_special_tokens=True)# Output the token IDsprint("Token IDs:", token_ids)# Convert token IDs back to raw tokens and output themraw_tokens = [tokenizer.decode([token_id]) for token_id in token_ids]print("Raw tokens:", raw_tokens)
The output is as follows:
Token IDs: [1026, 338, 625, 50138, 0]Raw tokens: ['It', "'s", ' over', ' 9000', '!']

So what is a Token ID? Why is it a number?

Let's analyze how this process works.
  • Suggested vocabulary
Building a vocabulary is part of the BPE algorithm and the steps are as follows:
  • Start with characters : The vocabulary initially consists of individual characters such as letters and punctuation marks.
  • Find common character pairs: Scan the training data (a large corpus of text) to find the character pairs that appear most frequently. For example, if "th" appears frequently, it will become a candidate word.
  • Merge and create new tokens: These common word pairs are then merged to form new tokens. This process is repeated, each time identifying and merging the next most common word pair. The vocabulary grows from single characters to common pairs and eventually to larger structures such as common words or parts of words.
  • Limiting vocabulary size: The vocabulary is limited (e.g., 50,000 in GPT-2). Once this limit is reached, processing stops, resulting in a fixed-size vocabulary that includes characters, common pairs, and more complex tokens.
  • Assigning Token IDs
The common methods for assigning a corresponding Token ID to each Token are as follows:
  • Indexing the vocabulary: Assign a unique numeric index or ID to each unique token in the final vocabulary. This approach is simple, just like indexing in a list or array.
  • Token ID representation : In GPT-2, each piece of text (such as a word or part of a word) is represented by the ID of the corresponding token in the vocabulary. If a word is not in the vocabulary, it is broken down into small tokens in the vocabulary.
  • Special Tokens: Special tokens (such as tokens representing the beginning and end of text or unknown words) are also assigned unique IDs.
The key point is that token IDs are not assigned arbitrarily, but are based on the frequency and combination patterns of the language data the model is trained on. This allows GPT-2 and similar models to efficiently process and generate human language using a manageable, representative set of tokens.

Here, vocabulary refers to the set of all unique tokens that the model can recognize and process. Essentially, it is the set of tokens created using a given tokenization method with the help of training data.

Most current LLMs use some variation of BPE. For example, the Mistral model uses the byte fallback BPE tokenizer.
In addition to BPE, there are some other methods, such as word fragment method, sentence fragment method and word fragment method.
Don't worry about that.
Now, what we need to know is that creating Tokens is the first step in dealing with NLP or LLM. There are different ways to create Tokens, and these Tokens are also assigned some Token ID, which represents the index of that Token in the vocabulary.




06


What is Embedding?

We come across this word quite often. Before discussing this word, let's clear up some confusion.
  • Token IDs are direct numeric representations of tokens. In fact, they do not capture any deeper relationships or patterns between tokens.
  • Standard vectorization techniques (like TF-IDF) involve creating more complex representations of the numbers based on some logic.
  • Embeddings are high-level vector representations of tokens. They attempt to capture the most subtle differences, connections, and semantics between tokens. Each embedding is usually a series of real numbers in a vector space calculated by a neural network.
In short, the input text is converted into Tokens. Each token is assigned a Token ID. These Token IDs can be used to create Embeddings for more detailed digital representation in complex models.

Why do this?

Because computers can understand numbers and perform operations on numbers. Embeddings are the "real input" of LLM.







07


Conversion from Token to Embedding


Just like different tokenization methods, we also have multiple methods for converting tokens to embeddings. Here are some commonly used methods:

  • Word2Vec: A neural network model

  • GloVe: An unsupervised learning algorithm for global vectors of word representations

  • FastText: An extension of Word2Vec

  • BERT: Bidirectional Encoder Representations from Transformers

  • ELMO: A deep bidirectional LSTM model.

We don’t need to worry about the inner workings of each method for now. All we need to know is that we can use them to create numerical representations of text that computers can understand.

Let's take BERT to create Embedding as an example:

from transformers import BertTokenizer, BertModelimport torch# Load pre-trained model tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')# Load pre-trained modelmodel = BertModel.from_pretrained('bert-base-uncased')# Text to be tokenizedtext = "It's over 9000!"# Encode textinput_ids = tokenizer.encode(text, add_special_tokens=True)# Output the token IDsprint("Token IDs:", input_ids)# Convert token IDs back to raw tokens and output themraw_tokens = [tokenizer.decode([token_id]) for token_id in input_ids]print("Raw tokens:", raw_tokens)# Convert list of IDs to a tensorinput_ids_tensor = torch.tensor([input_ids])# Pass the input through the modelwith torch.no_grad(): outputs = model(input_ids_tensor)# Extract the embeddingsembeddings = outputs.last_hidden_state# Print the embeddingsprint("Embeddings: ", embeddings)

The output is as follows:

Token IDs: [101, 2009, 1005, 1055, 2058, 7706, 2692, 999, 102]Raw tokens: ['[CLS]', 'it', "'", 's', 'over', '900', '##0', '!', '[SEP]']Embeddings: tensor([[[ 0.1116, 0.0722, 0.3173, ..., -0.0635, 0.2166, 0.3236], [-0.4159, -0.5147, 0.5690, ..., -0.2577, 0.5710, 0.4439], [-0.4893, -0.8719, , 0.6479, 0.2702, ..., 0.1755, -0.3939], [ 0.0846, -0.3420, 0.0216, ..., 0.6648, 0.3375, -0.2893], [ 0.6566, 0.2011, 0.0142, ..., 0.0786, -0.5767, -0.4356]]])

Observe the above code and you can see that:

  • As in the previous example using GPT-2, we first tokenize the text. The BERT model uses the wordpiece strategy. It basically breaks down words into smaller units based on specific criteria.

  • We get the token ID and then print the raw tokens. Note how it is different from the output of GPT-2.

  • We create a tensor from the token ID and pass it as input to the pre-trained BERT model.

  • We extract the final output from the last hidden state.


As you can see, Embeddings are basically arrays of numbers.









08


The role of embeddings


The embeddings process is so large and complex, so what does it mean?

  • The embedding of each token is a high-dimensional vector. This enables the model to capture a wide range of language features and nuances, such as the meaning of a word, semantic information, and its relationship to other words in the sentence.

  • Unlike simple word embeddings (like Word2Vec), BERT’s embeddings are contextual. This means that the same word can have different embeddings depending on its context (the words around it). Capturing this contextual nuance requires rich and complex embeddings.

  • In more complex models like BERT, we have access not only to the final embedding, but also to the embeddings of each layer of the neural network. Each layer captures a different aspect of language, increasing the complexity and size of the tensor.

  • These embeddings can be used as input for various NLP tasks such as sentiment analysis, question answering, and language translation. The richness of the embedding combinations enables the model to perform these tasks with a high degree of sophistication.

  • The complexity of these tensors reflects how the model "understands" language. Each dimension in the embedding can represent some abstract language feature that the model has learned during training.

In short, embeddigns are the secret sauce that makes LLMs work well. If you can find better embeddings, you’ll have the potential to create better models.

When these numbers are processed by a trained AI model architecture, it computes new values ​​in the same format that represent the answer to the task the model was trained on. In the case of LLM, this is the prediction for the next token.

When training an LLM, we are essentially trying to optimize all the mathematical calculations in the model related to the input embeddings to create the desired output.

All such calculations are contained in some parameters called model weights. They determine how the model processes input data to produce output.


Embeddings are actually a subset of LLM model weights. They are usually the weights associated with the Embedding layer (in models like Transformer) (usually the first layer).
Model weights and embeddings can be initialized (or calculated) as random variables or extracted from a pre-trained model. These values ​​are then updated during the training phase.
Our goal is to find appropriate values ​​for the model weights so that, given the input, the calculations made by the model produce the most accurate output in the given context.



09


in conclusion


When processing text input, the large model follows the process of Text -> Tokens -> Token IDs -> Embeddings, and Embeddings are the secret of LLM to understand the semantics of the input context. In addition, there are many different techniques to create Tokens and Embeddings, which has a great impact on how the model works.


I hope this can deepen your understanding of large language models.