Categorical Feature: A Comprehensive Guide

Categorical features are an important class of features. Classification features are discrete and discontinuous.
In this paper, we will introduce 5 mainstream coding methods for small and large classifications. And the advantages and disadvantages of each.
What are Categorical Features?
Categorical features are used to represent the classification, he is not like numerical category features are continuous. Categorical features are discrete.
For example:
-
Gender
-
City
-
Color
-
IP address
-
The user's account ID
Some categorized features are also numeric values, such as account ID and IP address. But these values are not continuous.
Continuous numbers are numerical class features, and discrete numbers are categorical features.
For an explanation of continuous and discret,e take a look at this article: 'Understanding about continuous and discrete'
Coding of Small Categorical Features
Natural Number Encoding / Sequential Encoding - Ordinal Encoding
Certain classifications are inherently ordered, in which case, simple natural number coding can be used.
An example is Degree:
Bachelor-0
Masters-1
Doctorate-2
Unique Hot Encoding - One-Hot Encoding
For city, color, brand, and material... these features are not suitable for natural number encoding because they are not sorted.
Using Unique-Hot Encoding allows the different categories to be on "equal footing" and not be affected by the size of the values.
An example is the color classification (assuming there are only 3 colors):
Red-100
Yellow - 010
Blue - 001
Similar to Dummy Encoding, there is also "Dummy Encoding" and "Effect Encoding".
The implementations are similar, but there are some slight differences, and they are applicable in different scenarios.
Encoding of Large Categorical Features
Target Encoding
Target encoding is a very efficient way to represent a categorical column and occupies only one feature space, also known as mean value encoding. Each value in the column is replaced by the mean target value for that category. This allows for a more direct representation of the relationship between the categorical and target variables.
Extended reading on target coding: An Introduction to Target Coding
Hash encoding
The hash function is also commonly heard as the hash function. A hash function is a deterministic function that maps a potentially unbounded integer to a finite range of integers [1, m].
If there is a categorization with 10,000 values, the code would be very long if using solo thermal coding. Whereas with hash coding, no matter how many different values the categorization has, it is converted to a fixed-length code.
Bin-Counting
Bin-counting is a little bit more complicated to think about: instead of using the value of the categorical variable as a feature, he uses the conditional probability of the target variable taking that value.
In other words, instead of coding the value of the categorical variable, we want to calculate the correlation statistic between the value of the categorical variable and the target variable to be predicted.
Summary of the Advantages and Disadvantages of Different Coding
One-Hot Encoding
Advantages
-
Easy to implement
-
Categorization is very accurate
-
Can be used for e-learning
Disadvantages
-
Not very computationally efficient
-
Can't adapt to growable categories
-
Only works with linear models
-
For large datasets, it requires large-scale distributed optimization
Hash Encoding
Advantages
-
Easy to implement
-
Lower model training costs
-
Easy to adapt to new classes
-
Easy to handle rare classes
-
Can be used for online learning
Disadvantages
-
Only suitable for linear models or kernel methods
-
Features cannot be interpreted after hashing
-
Accuracy is hard to guarantee
Bin-Counting
Advantages
-
Minimizes computational burden in the training phase
-
Can be used in tree-based models
-
Easy to adapt to new classes
-
Rare classes can be handled using back-off methods or minimum count graphs
-
Interpretable
Disadvantages
-
Requires historical data
-
Requires delayed updates, not entirely suitable for online learning
-
Likely to lead to data leakage
-
The above is excerpted from: Mastering Feature Engineering
Summary
Categorical features are discrete features, and numerical class features are continuous.
For small classifications, the common coding methods are:
-
Natural Number Encoding/Sequence Encoding-Ordinal Encoding
-
One-Hot Encoding-One-Hot Encoding
-
Dummy Encoding
-
Effect Encoding
For large classifications, the commonly used encoding methods are:
-
Target Encoding
-
Hash encoding
-
Bin-Counting