Data Trasformation

When working with machine learning models, raw data often isn’t enough. Features in your dataset may need transformation to make them understandable for algorithms, especially when dealing with categorical data. That’s where encoding techniques like One-Hot Encoding and Ordinal Encoding come into play.

What Are One-Hot Encoding and Ordinal Encoding?

One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a binary matrix. Each unique category is represented as a new column, and a row gets a 1 in the column of its category and 0 in others.

For example, consider the categorical variable Color with values Red, Blue, and Green:

ColorRedBlueGreen
Red100
Blue010
Green001

This method is handy for non-ordinal categories with no inherent order.

Let’s look at how we can use one-hot-encoding in Python

# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd

# Sample clothing data
data = pd.DataFrame({
    'Item': ['Shirt', 'Pants', 'Shoes', 'Shirt', 'Pants'],
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small']
})

Ordinal Encoding

# One-Hot Encoding for 'Item' and 'Color'
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_features = one_hot_encoder.fit_transform(data[['Item', 'Color']])
encoded_features_df = pd.DataFrame(
    encoded_features,
    columns=one_hot_encoder.get_feature_names_out(['Item', 'Color'])
)
# Create a table
one_hot_table = pd.concat([data[['Item', 'Color']], encoded_features_df], axis=1)

Producing the following output:

ItemColorItem_PantsItem_ShirtItem_ShoesColor_BlueColor_GreenColor_Red
ShirtRed010001
PantsBlue100100
ShoesGreen001010
ShirtBlue010100
PantsRed100001

The Item and Color columns are converted into binary columns, each representing one unique category.

Ordinal Encoding

On the other hand, Ordinal Encoding assigns a unique integer to each category, preserving the order if it exists. For example:

SizeEncoded Value
Small1
Medium2
Large3

This technique is suitable when the categories have a natural rank or progression.

Let’s analyze how we can code ordinal encoding in Python

# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import pandas as pd

# Sample clothing data
data = pd.DataFrame({
    'Item': ['Shirt', 'Pants', 'Shoes', 'Shirt', 'Pants'],
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small']
})
# Ordinal Encoding for 'Size'
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_encoded = ordinal_encoder.fit_transform(data[['Size']])
size_encoded_df = pd.DataFrame(size_encoded, columns=['Size_Encoded'])
table_ordinal = pd.concat([data[['Size']], size_encoded_df], axis=1)

Resulting in the following table:

SizeSize_Encoded
Small0.0
Large2.0
Medium1.0
Medium1.0
Small0.0

Where the Size column is converted into a single numeric column, preserving the order of sizes (Small = 0.0, Medium = 1.0, Large = 2.0).

Why Are They Important in Data Science?

  • Compatibility with Machine Learning Algorithms
    Many machine learning models work with numerical data, so categorical features must be transformed. Both encoders ensure that algorithms can process these variables without introducing bias.
  • Handling Different Types of Categorical Data
    • One-Hot Encoding is ideal for nominal data, where the categories are independent (e.g., City, Color).
    • Ordinal Encoding works best for ordinal data, where the categories have a defined sequence (e.g., Size, Education Level).
  • Interpretability and Performance
    Depending on the dataset and model, the choice of encoding can influence model performance and interpretability. For instance, tree-based models handle ordinal features naturally, while linear models may benefit more from one-hot encoded data.

In simple words: when to use which?

  • Use One-Hot Encoding when:
    • Categories are unordered.
    • The number of categories is relatively small.
    • You’re working with algorithms like Logistic Regression or Neural Networks.
  • Use Ordinal Encoding when:
    • Categories have a clear, meaningful order.
    • You want to reduce the number of features.
    • Algorithms like gradient-boosted trees or Random Forests are in use.

Wrapping It Up

Transforming your data might not sound glamorous, but it’s one of the most important steps in building a solid machine-learning model. Choosing the right encoding technique—One-Hot Encoding for unordered categories or Ordinal Encoding for ordered ones—helps your model make accurate predictions while keeping biases and noise out of the picture.

So, next time you’re wrangling categorical data, you’ll know exactly how to handle it!

Leave a comment

Website Built with WordPress.com.

Up ↑