Binning for Feature Engineering in Machine Learning

 

Binning for Feature Engineering in Machine Learning

Using binning as a technique to quickly and easily create new features for use in machine learning.

Photo by Tim Mossholder on Unsplash

If you have trained your model and still think the accuracy can be improved, it may be time for feature engineering. Feature engineering is the practice of using existing data to create new features. This post will focus on a feature engineering technique called “binning”.

This post will assume a basic understanding of Python, Pandas, NumPy, and matplotlib. Most of the time links are provided for a deeper understanding of what is being used. If something doesn’t make sense, please leave a comment and I will try my best to elaborate.

What is Binning?

Binning is a technique that accomplishes exactly what it sounds like. It will take a column with continuous numbers and place the numbers in “bins” based on ranges that we determine. This will give us a new categorical variable feature.

For instance, let’s say we have a DataFrame of cars.

Sample DataFrame of cars

We’ll focus on the MPG column and make 3 bins, fuel-efficient, average fuel efficiency, and gas guzzlers.

Making the Bins


Our first step will be to determine the ranges for these bins. This can be tricky as it can be done in a few different ways.

One way to do this is to divide the bins up evenly based on the distribution of values. Basically, we would go through the same process as if we were creating a histogram.

Since Python makes generating histograms easy with matplotlib, let’s visualize this data in a histogram and set the bins parameter to 3.

import import matplotlib.pyplot as plt
mpgs = cars['MPG']
plt.hist(mpgs, bins=3)
plt.show()
histogram of mpgs

It looks like our 3 bins are around 10–25, 25–40, and 40–56.

Let’s confirm we are correct by returning an array showing the bin widths.

plt.hist(mpgs, bins=3)[1]
\\array([10. , 25.33333333, 40.66666667, 56. ])

We could also do some research on fuel efficiency and see if there is a better way to create our bin ranges. With a quick google search, it is apparent that fuel efficiency is relative. It is determined by many factors usually related to the age of the car and the type/class of the car. To keep this example simple, we’ll stick with the histogram method. But, keep in mind understanding the data is an important part of binning and a little research can go a long way.

Putting the MPG in the Right Bin


Now that we know the ranges for our bins we can create a simple function that utilizes Pandas’ super cool cut function. Cut will split our column up using the label names and ranges we provide. Note, for 10 to be included we’ll have to start our first cut at 9.

def make_bins(df):
label_names = ["Gas Guzzler", "Average Fuel Efficiency","Fuel Effecient" ]
cut_points = [9, 25.333, 40.667, 56]
df["MPG_effeciency"] = pd.cut(df["MPG"], cut_points, labels=label_names)
return df

All we have to do now is pass our original DataFrame as an argument to the function.

new_cars_df = make_bins(cars)

This will give us …..drum roll, please…. a new DataFrame with an MPG category column!

Dummy it up

The last step is to get some dummies as our model will most likely only be able to take integers.
dummies = pd.get_dummies(new_cars_df["MPG_effeciency"])
new_cars_df = pd.concat([new_cars_df, dummies], axis=1)

And voilà, we have quickly generated new features to train with and hopefully improve the accuracy of our model.

Before you train with these newly engineered features be sure to delete the MPG column as the duplicate data could cause some undesired results.

And that’s it. Within a few minutes and we have engineered new features!


Comments

Popular posts from this blog

What is Gini Impurity?

20+ Questions to Test your Skills on Logistic Regression

Artificial Neural Network Interview Questions