Training Set, Validation Set, Test Set (With Segmentation Method + Cross-Validation)

Data is very important in AI technology!

In this post, we will introduce you to 3 types of datasets: training set, validation set, and test set.

It will also introduce how to divide the data into 3 types of datasets in a more reasonable way. Finally, we will introduce a way to make the best use of limited data: cross-validation.

Let's use an inappropriate analogy to illustrate the relationship between the 3 types of data sets:

The training set is equivalent to the knowledge learned in class
The validation set is equivalent to the practice questions at the end of the class, which are used to correct and reinforce the knowledge learned.
The test set is equivalent to the final exam, which is used to assess the learning effect.

What is a Training Dataset?

A Training Dataset is used to train a model.

What is a Validation Set?

When our model is trained, we don't know how it will perform. This is the time to use the Validation Dataset to see how the model performs on new data (validation and test sets are different data). At the same time, the hyperparameters are adjusted to keep the model in the best possible state.

Validation sets have 2 main roles:

Evaluating the model's effectiveness, in the service of tuning the hyperparameters
Adjusting the hyperparameters so that the model is at its best on the validation set

Description:

The validation set is not like the training and test sets, it is non-essential. If you don't need to tune the hyperparameters, you can skip the validation set and just use the test set to evaluate the effect.
The effect evaluated by the validation set is not the final effect of the model; it is mainly used to adjust the hyperparameters, and the final effect of the model is based on the evaluation result of the test set.

What is a Test Set?

Once we have tuned the hyperparameters, it's time to start the 'final exam'. We use the Test Dataset to do the final evaluation.

Through the evaluation of the Test Dataset, we will get some final evaluation metrics, such as Accuracy, Precision, Recall, F1, and so on.

How to Reasonably Divide the Dataset?

The following dataset division method mainly focuses on the validation method of the 'leave out method', in addition to other cross-validation methods, see below for more details - Cross-validation method.

The method of data division is not clearly defined, but you can refer to the 3 principles:

For small-scale sample sets (tens of thousands of magnitude), the common allocation ratio is 60% training set, 20% validation set, 20% test set.
For large-scale sample sets (more than a million), as long as the number of validation sets and test sets is sufficient, for example, if there are 100w of data, then leave 1w validation sets and 1w test sets. 1000w of data, also leave 1w validation sets and 1w test sets;
The fewer the hyperparameters, or the easier the hyperparameters are to adjust, the proportion of the validation set can be reduced and more allocated to the training set.

Cross-Validation

Why use cross-validation?

Suppose we teach children to learn addition:

1 apple + 1 apple = 2 apples.

When we retest, we ask:

1 banana + 1 banana = how many bananas?

If the child knows "2 bananas" and has no problem switching to something else, then we consider the child to have learned the point "1+1=2".

If the child only knows "1 apple + 1 apple = 2 apples" but has no problem switching to something else, then we cannot say that the child has learned the point "1+1=2".

Evaluating whether a model has learned 'a skill' also requires evaluating it with new data, not with data from the training set. This kind of validation method, where the 'training set' and 'test set' are completely different, is called cross-validation.

3 Main Cross-Validation Methods

Holdout cross-validation

As mentioned above, the data set is statistically divided into a training set, a validation set, and a test set according to a fixed ratio. The method mentioned above is the holdout method.

Leave one out cross-validation

Each time the test set has only one sample to be trained and predicted m times. The data used for training in this method is only one sample less than the overall dataset, and therefore is closest to the distribution of the original samples. But the training complexity increases because the number of models is the same as the number of samples of the original data. It is generally used when there is a lack of data.

k-fold cross validation (k-fold cross validation)

The static leave-out method is sensitive to the way the data is divided, and it is possible that different models are obtained from different divisions. The "k-fold cross-validation" is a dynamic validation method that can reduce the impact of data division. The specific steps are as follows:

Divide the dataset into a training set and a test set, and put the test set aside.
Divide the training set into k copies
Each time, use 1 of the k copies as the validation set, and all the others as the training set.
After k training sessions, we get k different models.
Evaluate the effectiveness of the k models and select the best hyperparameters from them
Use the optimal hyperparameters and retrain the model using all k copies as a training set to get the final model.

k is usually set to 10. When the amount of data is small, k can be set larger, so that the training set accounts for a larger proportion of the overall data, but the number of models trained at the same time also increases. When the amount of data is large, k can be set smaller.