Balance of Classes
Class imbalance is such an important factor in modeling! I might be saying a lot of things are important but they are! Everything little thing that goes into cleaning data has an important role in order to have a perfect or close to the perfect model.
Now, what is the class imbalance? It is exactly its name. For certain datasets specifically classifications dataset, some classes have more datasets than others. You may think so what? Well, it is very important!! Let’s take for instance, a classification class has three different classes.
Class 1: 1,456 datapoints
Class 2: 786 datapoints
Class3: 467 datapoints
We can clearly see class 1 has many more datapoints than compared to class 3 and nearly double of data points than class2. When we begin to build our model it will mainly read class 1, and make it easier to see its features. You may think okay cool, that's what we want. Yes… technically. We want the model to be able to read all the classes and identify each one equally. If we create a model with the current datapoints it will only know how to perfectly predict for class 1.
You may ask… how can we fix this?
Imagine a balance that is leaning more towards one side, but it is our duty to even out the weight as so.
This is where we can use multiple techniques for class imbalance. The techniques are, SMOTE, Resampling (Over Sampling), Resampling (Under Sampling), Cluster-Based Over Sampling, Bagging, and more. A very detailed description can be found here.
Over the next few weeks, I will go into detail about each technique. I will begin with my favorite SMOTE!
As always any feedback is appreciated! :)