OverSampling

Emilia Orellana
2 min readDec 17, 2020

This is a continuous series from my period blog of class imbalance. Last week's blog can be found here. This week I will be explaining another technique called OverSampling.

Random Over Sampling is gathering a sample from the group that has the least data points and adding them to the training set. There is another technique very similar called Random Under Sampling and it is the opposite. In this technique, it randomly selects from the group that has the highest data points and does not take the subgroup pinto consideration for the training set.

From the visual, we can see a description of what I mentioned above. In oversampling we see the light blue set had a much larger amount of data points compared to the red dataset. We are only focusing on the group with the least amount and make copies to duplicate the values for the training set before we being building models.

In comparison to undersampling, the steps are the opposite. We are gathering a subgroup from the majority group. The subgroup is the group that will be taken into consideration for the training set. The point of random sampling, whether it is over or under-sampling we are trying to even out the group size so our model does not interfere with uneven groups.

This website has a great explanation of both techniques.

Below is the code beneficial for random oversampling the dataset.

I hope this blog has been helpful for whoever comes across!

Any feedback is always appreciated!

--

--