LLM: Introducing the Transformer Architecture

Written by

Silas Grey

Updated on:July-12th-2025

Most modern large-scale language models (LLMs) rely on the Transformer architecture, a deep neural network architecture proposed in the 2017 paper "Attention is All You Need" (https://arxiv.org/abs/1706.03762). To understand LLMs, it is necessary to first understand the original Transformer, which was developed for machine translation tasks to translate English text into German and French. A simplified version of the Transformer architecture is shown in Figure 1.4.

Figure 1.4 A simplified version of the original Transformer architecture, a deep learning model for language translation. The Transformer consists of two parts: (a) an encoder, which processes the input text and generates an embedding representation of the text (a numerical representation that captures many different factors in different dimensions), and (b) a decoder, which can use these embeddings to generate the translated text word by word. This figure shows the final stage of the translation process, where the decoder needs to generate only the final word ("Beispiel") to complete the entire translation given the original input text ("This is an example") and the partially translated sentence ("Das ist ein").

The Transformer architecture consists of two submodules: the encoder and the decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. The decoder module then takes these encoded vectors and generates the output text. For example, in a translation task, the encoder encodes the text in the source language into vectors, and the decoder decodes these vectors to generate text in the target language. Both the encoder and the decoder contain many layers that are connected by a so-called self-attention mechanism. You may have many questions about how the input is preprocessed and encoded. These questions will be answered in the step-by-step implementation in the subsequent chapters.

A key component of the Transformer and LLMs is the self-attention mechanism (not shown in Figure 1.4), which allows the model to weigh the relative importance of different words or tokens in a sequence. This mechanism enables the model to capture long-range dependencies and contextual relationships in the input data, enhancing its ability to generate coherent and contextually relevant outputs. However, due to its complexity, we will discuss it further and implement it step by step in the following content.

Later variants of the Transformer architecture, such as BERT (short for Bidirectional Encoded Representations from Transformers) and various GPT models (short for Generative Pre-trained Transformers), are built on this concept and are designed to adapt to different tasks.

BERT, built on the encoder submodule of the original Transformer, differs from GPT in its training approach. While GPT was designed for generation tasks, BERT and its variants focus on masked word prediction, where the model predicts masked or hidden words in a given sentence, as shown in Figure 1.5. This unique training strategy enables BERT to excel in text classification tasks, including sentiment prediction and document classification. As an application example of its strength, as of this time, X (formerly Twitter) uses BERT to detect harmful content.

Figure 1.5 Visual representation of transformer encoder and decoder submodules. On the left is the encoder part, which exemplifies large language models (LLMs) similar to BERT, which focus on masked word prediction and are mainly used for tasks such as text classification. On the right is the decoder part, which shows large language models similar to GPT, which are designed for generative tasks and generate coherent text sequences.

GPT, on the other hand, focuses on the decoder part of the original transformer architecture and is designed for tasks that require generating text. This includes machine translation, text summarization, novel writing, writing computer code, etc.

GPT models, which are primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models excel at performing both zero-shot learning and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior concrete examples. Few-shot learning, on the other hand, involves learning from the minimum number of examples provided by the user as input, as shown in Figure 1.6.

Figure 1.6 In addition to text completion, large language models similar to GPT can solve a variety of tasks based on their input without retraining, fine-tuning, or changing the task-specific model architecture. Sometimes it is helpful to provide target examples in the input, which is called the few-shot setting. However, large language models similar to GPT are also able to perform tasks without specific examples, which is called the zero-shot setting.

★

Transformers and LLMs

Today’s large language models (LLMs) are based on transformer architectures. As a result, the terms transformers and LLMs are often used interchangeably in the literature. However, note that not all transformers are LLMs, as transformers can also be used in the field of computer vision. Similarly, not all LLMs are transformer-based, as there are LLMs based on recurrent and convolutional architectures. The main motivation for these alternative approaches is to improve the computational efficiency of LLMs. It remains to be seen whether these alternative LLM architectures will be able to compete with the capabilities of transformer-based LLMs and whether they will be adopted in practice. For simplicity, this article uses the term “LLM” to refer to transformer-based LLMs similar to GPT.