Encoding Categorical Variables
An important preprocessing step in machine learning is converting categorical variables into a numerical format through encoding. This template will cover how to handle binary and ordered categorical variables with label encoding, as well as one-hot encoding for unordered categorical data.
To swap in your dataset in this template, the following is required:
- There must be at least one column with a categorical variable that you want to encode.
- There must be no NaN/NA values. You can use this template to impute missing values if needed.
The placeholder dataset in this template is bank marketing data with details such as job, education, and marital status. Each row represents a different customer. You can find more information on this dataset's source and dictionary here.
# Import packages
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Load the dataset into a DataFrame
df = pd.read_csv("bank.csv") # Replace with the file you want to use
# Preview the DataFrame
df
Label Encoding
Label encoding is a process where categorical values are replaced by numeric data (i.e., 0, 1, 2, ...). It is appropriate for both binary data and ordinal data (i.e., categorical data that has an inherent order). To label encode categorical data, you can use the LabelEncoder() class from sklearn.
Note: You can also use OrdinalEncoder() to perform a similar operation on multiple features.
# Create a copy of the original DataFrame
df_encoded = df.copy()
# Specify the column you wish to one-hot encode
label_column = "education"
# Initialize the LabelEncoder
le = LabelEncoder()
# Create a new column using the fit_transform method of the LabelEncoder
df_encoded[label_column + "_enc"] = le.fit_transform(df_encoded[label_column])
# Preview the original and encoded column
df_encoded[[label_column, label_column + "_enc"]]
One-Hot Encoding Using pandas
One-hot encoding converts each value in a categorical column into a new column containing 0s and 1s. The simplest way to one-hot encode columns in a DataFrame is to use pandas' get_dummies() function, which allows you to specify a subset of the data.
You simply need to specify the DataFrame that you wish to use. In this example, there are two key arguments:
columns
allows you to choose which columns you wish to be encoded. All columns with anobject
orcategory
data type will be encoded if this is not specified. You may sometimes want to avoid this if some categorical columns contain many different values.drop_first
allows you to return k-1 dummy variables if there are k categories (thus reducing the number of features you create).
# Specify the columns you wish to one-hot encode
categorical_columns = [
"job",
"marital"
]
# Perform the one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
# View the resulting DataFrame
df_encoded
One-Hot Encoding Using sklearn
You can also use sklearn's OneHotEncoder to one-hot encode categorical columns. While the process is not as simple as it is with pandas, there are key advantages for machine learning. Most importantly, OneHotEncoder()
can ensure consistency when working with new data. In this example, the encoder is initialized and fit to a subset of the data. The data is then transformed, the column names are retrieved, and it is joined with the original data.
While initializing the encoder, the following two arguments are used:
handle_unknown
tells the encoder how to treat unknown categorical features during the transform. If set to "error" the encoder will produce an error if it encounters unknown categorical features. If it is set to "ignore", the columns for the problematic feature will contain zeros.sparse
specifies whether a sparse matrix or an array is returned. The code below only works with an array, sosparse
is set to False.
# Specify the columns you wish to one-hot encode
categorical_columns = ["job", "marital"]
# Filter the DataFrame for the categorical features
cat_features = df[categorical_columns]
# Initialize the OneHotEncoder and fit it to the categorical features
enc = OneHotEncoder(handle_unknown="ignore", sparse=False)
enc.fit(cat_features)
# Use the transform method to one hot encode the categorical data and then convert it to a DataFrame
enc_data = pd.DataFrame(
enc.transform(cat_features),
columns=enc.get_feature_names_out(categorical_columns)
)
# Join with the rest of the data and preview the DataFrame
df_encoded = df.join(enc_data)
df_encoded
Once you have encoded all the categorical variables you want to use, you can remove the original columns and feed the data into a model. If you would like to learn more about preprocessing techniques, be sure to check out the DataCamp course Preprocessing Data for Machine Learning in Python.