Categorical Feature: A Comprehensive Guide

Written by
Clara Bennett
Updated on:May-30th-2025

 

Categorical features are an important class of features. Classification features are discrete and discontinuous.

 

In this paper, we will introduce 5 mainstream coding methods for small and large classifications. And the advantages and disadvantages of each.

 

What are Categorical Features?

Categorical features are used to represent the classification, he is not like numerical category features are continuous. Categorical features are discrete.

For example:

  • Gender

  • City

  • Color

  • IP address

  • The user's account ID

 

Some categorized features are also numeric values, such as account ID and IP address. But these values are not continuous.

Continuous numbers are numerical class features, and discrete numbers are categorical features.

For an explanation of continuous and discret,e take a look at this article: 'Understanding about continuous and discrete'

 

Coding of Small Categorical Features

 

Natural Number Encoding / Sequential Encoding - Ordinal Encoding

Certain classifications are inherently ordered, in which case, simple natural number coding can be used.

An example is Degree:

Bachelor-0

Masters-1

Doctorate-2

 

Unique Hot Encoding - One-Hot Encoding

For city, color, brand, and material... these features are not suitable for natural number encoding because they are not sorted.

Using Unique-Hot Encoding allows the different categories to be on "equal footing" and not be affected by the size of the values.

An example is the color classification (assuming there are only 3 colors):

Red-100

Yellow - 010

Blue - 001

Similar to Dummy Encoding, there is also "Dummy Encoding" and "Effect Encoding".

The implementations are similar, but there are some slight differences, and they are applicable in different scenarios.

 

Encoding of Large Categorical Features

 

Target Encoding

Target encoding is a very efficient way to represent a categorical column and occupies only one feature space, also known as mean value encoding. Each value in the column is replaced by the mean target value for that category. This allows for a more direct representation of the relationship between the categorical and target variables.

Extended reading on target coding: An Introduction to Target Coding

 

Hash encoding

The hash function is also commonly heard as the hash function. A hash function is a deterministic function that maps a potentially unbounded integer to a finite range of integers [1, m].

If there is a categorization with 10,000 values, the code would be very long if using solo thermal coding. Whereas with hash coding, no matter how many different values the categorization has, it is converted to a fixed-length code.

 

Bin-Counting

Bin-counting is a little bit more complicated to think about: instead of using the value of the categorical variable as a feature, he uses the conditional probability of the target variable taking that value.

In other words, instead of coding the value of the categorical variable, we want to calculate the correlation statistic between the value of the categorical variable and the target variable to be predicted.

 

Summary of the Advantages and Disadvantages of Different Coding

One-Hot Encoding

Advantages

  • Easy to implement

  • Categorization is very accurate

  • Can be used for e-learning

Disadvantages

  • Not very computationally efficient

  • Can't adapt to growable categories

  • Only works with linear models

  • For large datasets, it requires large-scale distributed optimization

 

Hash Encoding

Advantages

  • Easy to implement

  • Lower model training costs

  • Easy to adapt to new classes

  • Easy to handle rare classes

  • Can be used for online learning

Disadvantages

  • Only suitable for linear models or kernel methods

  • Features cannot be interpreted after hashing

  • Accuracy is hard to guarantee

 

Bin-Counting

Advantages

  • Minimizes computational burden in the training phase

  • Can be used in tree-based models

  • Easy to adapt to new classes

  • Rare classes can be handled using back-off methods or minimum count graphs

  • Interpretable

Disadvantages

  • Requires historical data

  • Requires delayed updates, not entirely suitable for online learning

  • Likely to lead to data leakage

  • The above is excerpted from: Mastering Feature Engineering

 

Summary

Categorical features are discrete features, and numerical class features are continuous.

For small classifications, the common coding methods are:

  • Natural Number Encoding/Sequence Encoding-Ordinal Encoding

  • One-Hot Encoding-One-Hot Encoding

  • Dummy Encoding

  • Effect Encoding

For large classifications, the commonly used encoding methods are:

  • Target Encoding

  • Hash encoding

  • Bin-Counting