A Beginner's Guide To Large Language Models Tuning Hyperparameters

In the technological development of Artificial Intelligence, Large Language Models (LLMs) are refreshing our knowledge of machine understanding of language with their powerful capabilities. However, the key to fine-tuning these models to maximize their performance in specific application scenarios lies in how to skillfully tune their hyperparameters.
This article introduces some common LLM hyperparameters, reveals how they affect the model's performance, and guides you on how to fine-tune them so that the model's output better meets your expectations.
What Are Hyperparameters?
Hyperparameters are parameters that are set before a large model is fine-tuned to start the learning process, rather than parameters that are obtained through training. In other words, these are parameters that we need to decide on before we start training, and they affect the learning process (i.e., how the model is trained) as well as the model's performance (e.g., accuracy).
Hyperparameters are configuration terms that can be used to influence or control the process of training the LLM. Unlike model parameters or weights, hyperparameters do not change as training data are passed; instead, they are external to the model and are set before training begins. Although they control the process of training the LLM, they do not become part of the final base model, and we cannot determine which hyperparameters were used by the model during training.
The hyperparameters of the LLM are important and provide a controlled way to adjust the behavior of the model to produce the results required for a particular use case. It is possible to reconfigure the base model by adjusting the hyperparameters so that it follows our desired performance, rather than going through the extensive effort and cost of developing a customized model.
The Value of Hyperparameters
There are a number of factors to consider when we choose the best model for a large language. There is no doubt that there is a strong correlation between the number of parameters and the size of the model, so it is a wise strategy to check the size of the LLM. It is also useful to see how well it performs in common benchmarking or inference performance tests (SOTA) - these give not only quantitative metrics of performance, but also a measure of how LLMs measure up against each other.
However, after selecting the LLMs that seem best suited to the requirements, there are other ways to further shape the language model to fit our specific needs -- hyperparameters. In fact, the choice of hyperparameters and how they are configured may be the key to good or poor LLM performance.
Categories of hyperparameters
Model Size
The first hyperparameter to consider is the size of the LLM you want to use. In general, larger models perform better and can handle complex tasks because they have more layers in the neural network. Making it possible to have more weights to learn from the training data and better determine the linguistic and logical relationships between tokens.
However, larger LLMs mean that they are more expensive, require larger datasets for training and more computational resources to run, and typically run slower than smaller models. In addition, the larger the model the more likely it is to overfit, i.e., the model is too familiar with its training data to generalize consistently to data it has not seen before.
In contrast, a smaller base LLM can perform as well as its larger counterpart on simple tasks while requiring fewer resources to train and reason about the run. This is especially true if the model is quantized (using compression techniques to reduce the weight size) or fine-tuned (i.e., further trained with additional data). And the smaller the LLM, the easier it is to deploy, and the more feasible it is on devices with lower versions of GPUs.
Ultimately, the optimal size of an LLM depends on the nature of the use case to which it is intended to be applied. The more complex the task, the more computational resources and training data available, the larger the model can be.
Number of Epochs
An epoch is a complete iteration of the entire dataset processed by Large Language Models. As a hyperparameter, the number of epochs set affects the output by helping to determine the capabilities of the model.
More epochs can help the model increase its understanding of a language and its semantic relationships. However, too many epochs can lead to overfitting - i.e., the model is too specialized to the training data and struggles with generalization capabilities. Conversely, too few epochs can lead to underfitting, where a large language model does not learn enough from its training data to properly configure its weights and biases.
Learning Rate
The learning rate is a fundamental LLM hyperparameter that controls how quickly the model is updated according to the computed loss function, i.e., how often it predicts wrong output labels during training. On the one hand, a higher learning rate speeds up the training process, but may lead to instability and overfitting. On the other hand, lower learning rates increase stability and improve generalization during inference, but extend the training time.
Furthermore, by using learning rate scheduling, it is often beneficial to decrease the learning rate of the LLM as its training progresses. Time-based decay, step decay, and exponential decay are the three most common learning rate schedules:
-
Time-based decay: reduces the learning rate based on a preset time value.
-
Step decay: Also known as linear decay, the learning rate is reduced by a decay factor every few cycles.
-
Exponential decay: Decreases the learning rate proportionally for each period.
Batch Size
The batch size parameter of an LLM determines the amount of data processed by the model in each epoch. Creating a batch size requires splitting the dataset into several parts, so a larger Batch Size speeds up the training process compared to a smaller batch size. However, smaller batches require less memory and computational power and can help the LLM model process each data point of the corpus more thoroughly. Given the computational requirements, the batch size is usually limited by the hardware capacity.
Max Output Tokens
Maximum Output Tokens, also often referred to as Maximum Sequence Length, is the maximum number of tokens that a model can generate as output. While the final amount of tokens a model can output is determined by its architecture, this can be further configured as a hyperparameter to affect the output of a response.
In general, the higher the maximum output token setting, the more coherent and contextually relevant the model's response will be. Allowing LLM to use more output tokens when formulating a response allows it to better express itself and fully address the ideas given in the input prompts. However, this comes at a cost - the longer the output, the more reasoning the model performs, increasing computation and memory requirements.
Conversely, setting a lower maximum token limit requires less processing power and memory, but may not provide enough room for the model to formulate an optimal response, which can lead to incoherence and errors. In some cases, there are benefits to setting a lower maximum sequence length, e.g., to better control inference costs; to limit the amount of text generated to a specific format; when trying to improve other aspects of an LLM's performance such as throughput or latency, and wanting to speed up the process by shortening the inference time.
Decoding Type
In the Transformer architecture that makes up most LLMs, there are two stages of inference: encoding and decoding. Encoding is the process of converting the user's input prompts into vector embeddings, i.e., converting text into a numeric representation so that the model can generate the best possible response. Decoding is the process by which the selected output is first converted from vector embeddings to tokens, which are then presented to the user as answers.
There are two main types of decoding: greedy and sampling. In greedy decoding, the model simply selects the token with the highest probability at each step of the inference process. Sampling decoding is the opposite, where the model selects a subset of potential tokens and then randomly chooses a token to add to the output text. This creates more imagination or randomness, a desirable feature in creative applications of language modeling, though the choice of sampling decoding increases response error and risk.
Top-k and Top-p Sampling
When sampling decoding is chosen over greedy decoding, two additional hyperparameters affect the model's output: top-k and top-p.
The Top-k sampling value is an integer ranging from 1 to 100 (default value is 50), which specifies that the tokens sampled by the model should be the ones with the highest probability until the set value is reached.
The Top-p sample value is a decimal number in the range 0.0 to 1.0 that configures the model to sample between the highest probabilities of a sample until the sum of those probabilities reaches the set value.
If both samples are set, Top-k is preferred, and all probabilities that exceed the threshold set are set to 0.
Temperature
Temperature is similar to the Top-k and Top-p sampling values described above, and provides a way to change the range of possible output tokens and influence the "creativity" of the model. It is represented by a decimal number between 0.0 and 2.0 (0.0 is effectively the same as greedy decoding, i.e., tokens are added to the output with the highest probability; 2.0 represents maximum creativity).
Temperature affects the output by changing the shape of the token probability distribution. For low settings, the difference between probabilities is amplified, making tokens with higher probabilities more likely to be output relative to those with lower probabilities. Therefore, the temperature value should be set low when the model should generate more predictable or reliable responses. In contrast, a high setting tends to make token probabilities closer together, so that unlikely or unusual tokens have a greater chance of being output. For this reason, a high temperature should be set if you want to increase the randomness and creativity of responses.
Stop Sequences
Another way to influence the length of LLM responses is to automatically stop the model's output by specifying stop sequences. A stop sequence is a string of one or more characters. A common example of a stop sequence is a period (. / 。).
Alternatively, you can specify the end of the sequence by setting a stop sign limit, which is an integer value rather than a string. For example, if the stop-tag limit is set to 1, the generated output will stop at a sentence. On the other hand, if set to 2, the response will be limited to a paragraph. For budgetary reasons, setting stop sequences or stop tokens provides better control over the inference process.
Frequency and Presence Penalties
The frequency penalty, also known as the repetition penalty, is a small number between -2.0 and 2.0 that tells the model that it should avoid using the same token too often. It works by reducing the probability that the most recently added token to the response will be reused to produce a more varied output.
Presence penalties are similar, but are only applied to tokens that have been used at least once, whereas frequency penalties are applied proportionally to how often a particular token is used. In other words, frequency penalties affect output by preventing duplication, while presence penalties encourage a wider range of tokens.
Hyperparameter Tuning
Hyperparameter tuning is the process of adjusting different hyperparameters during training, intending to find combinations that produce optimal output. However, this inevitably involves a great deal of trial and error, requiring the precise tracking of the application of each hyperparameter and the recording of the corresponding results in the output. As a result, performing this process manually can be time-consuming. In response to this problem, automated hyperparameter tuning methods have emerged to greatly simplify this process.
The three most common approaches to automated hyperparameter tuning are random search, grid search, and Bayesian optimization.
Random Search
Random search methods randomly select and evaluate hyperparameter combinations from a certain range, making them a simple and efficient method capable of traversing a large number of parameter spaces. However, due to its simplicity, it sacrifices some performance and may fail to find optimal hyperparameter combinations, as well as taking up large computational resources.
Grid search
In contrast to random search, this method exhaustively searches for every possible combination of hyperparameters from a range of values. Although as resource-intensive as random search, it provides a more systematic way to ensure that a method is found that guarantees the best choice of hyperparameters.
Bayesian optimization
Unlike the above two methods, it uses a probabilistic model to predict the performance of different hyperparameters and selects the best hyperparameters for a better response. This makes it an efficient tuning method that both handles large parameter spaces better and requires fewer resources than grid search. The downside, however, is that it is more complex to set up and is not as effective as grid search in identifying the optimal set of hyperparameters.
Another advantage offered by automated hyperparameter tuning is that multiple language models can be developed so that each model has a different combination of hyperparameters. By training them on the same dataset and comparing their outputs, the best use cases are identified. Similarly, each model tuned for a different range of hyperparameters and values may be better suited for a different use case.
Summarizing
Through in-depth analysis, we have learned that hyperparameter tuning is not just a technical activity, but an art. It requires us to have a deep understanding of the model, a keen insight into the data, and a clear understanding of the goal. Each hyperparameter tuning is like having a carefully crafted conversation with the model, designed to guide it to better serve our vision. Remember, there is no optimal configuration that is set in stone, only optimal solutions that are constantly explored and adapted. Let's use this article as a starting point to continue our journey in AI, searching for those hyperparameter combinations that can illuminate the light of wisdom.