When working with machine learning models, raw data often isn’t enough. Features in your dataset may need transformation to make them understandable for algorithms, especially when dealing with categorical data. That’s where encoding techniques like One-Hot Encoding and Ordinal Encoding come into play.
What Are One-Hot Encoding and Ordinal Encoding?
One-Hot Encoding
One-hot encoding is a technique used to convert categorical variables into a binary matrix. Each unique category is represented as a new column, and a row gets a 1 in the column of its category and 0 in others.
For example, consider the categorical variable Color with values Red, Blue, and Green:
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |
This method is handy for non-ordinal categories with no inherent order.
Let’s look at how we can use one-hot-encoding in Python
# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd
# Sample clothing data
data = pd.DataFrame({
'Item': ['Shirt', 'Pants', 'Shoes', 'Shirt', 'Pants'],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small']
})
Ordinal Encoding
# One-Hot Encoding for 'Item' and 'Color'
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_features = one_hot_encoder.fit_transform(data[['Item', 'Color']])
encoded_features_df = pd.DataFrame(
encoded_features,
columns=one_hot_encoder.get_feature_names_out(['Item', 'Color'])
)
# Create a table
one_hot_table = pd.concat([data[['Item', 'Color']], encoded_features_df], axis=1)
Producing the following output:
| Item | Color | Item_Pants | Item_Shirt | Item_Shoes | Color_Blue | Color_Green | Color_Red |
|---|---|---|---|---|---|---|---|
| Shirt | Red | 0 | 1 | 0 | 0 | 0 | 1 |
| Pants | Blue | 1 | 0 | 0 | 1 | 0 | 0 |
| Shoes | Green | 0 | 0 | 1 | 0 | 1 | 0 |
| Shirt | Blue | 0 | 1 | 0 | 1 | 0 | 0 |
| Pants | Red | 1 | 0 | 0 | 0 | 0 | 1 |
The Item and Color columns are converted into binary columns, each representing one unique category.
Ordinal Encoding
On the other hand, Ordinal Encoding assigns a unique integer to each category, preserving the order if it exists. For example:
| Size | Encoded Value |
|---|---|
| Small | 1 |
| Medium | 2 |
| Large | 3 |
This technique is suitable when the categories have a natural rank or progression.
Let’s analyze how we can code ordinal encoding in Python
# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd
# Sample clothing data
data = pd.DataFrame({
'Item': ['Shirt', 'Pants', 'Shoes', 'Shirt', 'Pants'],
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small']
})
# Ordinal Encoding for 'Size'
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_encoded = ordinal_encoder.fit_transform(data[['Size']])
size_encoded_df = pd.DataFrame(size_encoded, columns=['Size_Encoded'])
table_ordinal = pd.concat([data[['Size']], size_encoded_df], axis=1)
Resulting in the following table:
| Size | Size_Encoded |
|---|
| Small | 0.0 |
| Large | 2.0 |
| Medium | 1.0 |
| Medium | 1.0 |
| Small | 0.0 |
Where the Size column is converted into a single numeric column, preserving the order of sizes (Small = 0.0, Medium = 1.0, Large = 2.0).
Why Are They Important in Data Science?
- Compatibility with Machine Learning Algorithms
Many machine learning models work with numerical data, so categorical features must be transformed. Both encoders ensure that algorithms can process these variables without introducing bias. - Handling Different Types of Categorical Data
- One-Hot Encoding is ideal for nominal data, where the categories are independent (e.g.,
City,Color). - Ordinal Encoding works best for ordinal data, where the categories have a defined sequence (e.g.,
Size,Education Level).
- One-Hot Encoding is ideal for nominal data, where the categories are independent (e.g.,
- Interpretability and Performance
Depending on the dataset and model, the choice of encoding can influence model performance and interpretability. For instance, tree-based models handle ordinal features naturally, while linear models may benefit more from one-hot encoded data.
In simple words: when to use which?
- Use One-Hot Encoding when:
- Categories are unordered.
- The number of categories is relatively small.
- You’re working with algorithms like Logistic Regression or Neural Networks.
- Use Ordinal Encoding when:
- Categories have a clear, meaningful order.
- You want to reduce the number of features.
- Algorithms like gradient-boosted trees or Random Forests are in use.
Wrapping It Up
Transforming your data might not sound glamorous, but it’s one of the most important steps in building a solid machine-learning model. Choosing the right encoding technique—One-Hot Encoding for unordered categories or Ordinal Encoding for ordered ones—helps your model make accurate predictions while keeping biases and noise out of the picture.
So, next time you’re wrangling categorical data, you’ll know exactly how to handle it!
Leave a comment