Categorical Feature: A Comprehensive Guide

Categorical features are an important class of features. Classification features are discrete and discontinuous.

In this paper, we will introduce 5 mainstream coding methods for small and large classifications. And the advantages and disadvantages of each.

What are Categorical Features?

Categorical features are used to represent the classification, he is not like numerical category features are continuous. Categorical features are discrete.

For example:

Gender
City
Color
IP address
The user's account ID

Some categorized features are also numeric values, such as account ID and IP address. But these values are not continuous.

Continuous numbers are numerical class features, and discrete numbers are categorical features.

For an explanation of continuous and discret,e take a look at this article: 'Understanding about continuous and discrete'

Coding of Small Categorical Features

Natural Number Encoding / Sequential Encoding - Ordinal Encoding

Certain classifications are inherently ordered, in which case, simple natural number coding can be used.

An example is Degree:

Bachelor-0

Masters-1

Doctorate-2

Unique Hot Encoding - One-Hot Encoding

For city, color, brand, and material... these features are not suitable for natural number encoding because they are not sorted.

Using Unique-Hot Encoding allows the different categories to be on "equal footing" and not be affected by the size of the values.

An example is the color classification (assuming there are only 3 colors):

Red-100

Yellow - 010

Blue - 001

Similar to Dummy Encoding, there is also "Dummy Encoding" and "Effect Encoding".

The implementations are similar, but there are some slight differences, and they are applicable in different scenarios.

Encoding of Large Categorical Features

Target Encoding

Target encoding is a very efficient way to represent a categorical column and occupies only one feature space, also known as mean value encoding. Each value in the column is replaced by the mean target value for that category. This allows for a more direct representation of the relationship between the categorical and target variables.

Extended reading on target coding: An Introduction to Target Coding

Hash encoding

The hash function is also commonly heard as the hash function. A hash function is a deterministic function that maps a potentially unbounded integer to a finite range of integers [1, m].

If there is a categorization with 10,000 values, the code would be very long if using solo thermal coding. Whereas with hash coding, no matter how many different values the categorization has, it is converted to a fixed-length code.

Bin-Counting

Bin-counting is a little bit more complicated to think about: instead of using the value of the categorical variable as a feature, he uses the conditional probability of the target variable taking that value.

In other words, instead of coding the value of the categorical variable, we want to calculate the correlation statistic between the value of the categorical variable and the target variable to be predicted.

Summary of the Advantages and Disadvantages of Different Coding

One-Hot Encoding

Advantages

Easy to implement
Categorization is very accurate
Can be used for e-learning

Disadvantages

Not very computationally efficient
Can't adapt to growable categories
Only works with linear models
For large datasets, it requires large-scale distributed optimization

Hash Encoding

Advantages

Easy to implement
Lower model training costs
Easy to adapt to new classes
Easy to handle rare classes
Can be used for online learning

Disadvantages

Only suitable for linear models or kernel methods
Features cannot be interpreted after hashing
Accuracy is hard to guarantee

Bin-Counting

Advantages

Minimizes computational burden in the training phase
Can be used in tree-based models
Easy to adapt to new classes
Rare classes can be handled using back-off methods or minimum count graphs
Interpretable

Disadvantages

Requires historical data
Requires delayed updates, not entirely suitable for online learning
Likely to lead to data leakage
The above is excerpted from: Mastering Feature Engineering

Summary

Categorical features are discrete features, and numerical class features are continuous.

For small classifications, the common coding methods are:

Natural Number Encoding/Sequence Encoding-Ordinal Encoding
One-Hot Encoding-One-Hot Encoding
Dummy Encoding
Effect Encoding

For large classifications, the commonly used encoding methods are:

Target Encoding
Hash encoding
Bin-Counting