This is a continuous series from my period blog of class imbalance. Last week's blog can be found here. This week I will be explaining another technique called OverSampling.

Random Over Sampling is gathering a sample from the group that has the least data points and adding them to the training set. There is another technique very similar called Random Under Sampling and it is the opposite. In this technique, it randomly selects from the group that has the highest data points and does not take the subgroup pinto consideration for the training set.

From the visual, we can see…


A technique to overcome class imbalance can be SMOTE. SMOTE is my favorite technique because, in my opinion, it is the easiest. As I explained in the first blog about the class imbalance that can be found here. SMOTE is used when there are very few data points in a class. As I have mentioned before it is important to fix class imbalance in the dataset because when we are creating the models there has to be an equal chance for the model to predict a classification.

Synthetic Minority Oversampling Technique (SMOTE) can be understood by the visual above. We…


Class imbalance is such an important factor in modeling! I might be saying a lot of things are important but they are! Everything little thing that goes into cleaning data has an important role in order to have a perfect or close to the perfect model.

Now, what is the class imbalance? It is exactly its name. For certain datasets specifically classifications dataset, some classes have more datasets than others. You may think so what? Well, it is very important!! Let’s take for instance, a classification class has three different classes.

Class 1: 1,456 datapoints

Class 2: 786 datapoints

Class3…


This is the last part of my series blog. So sad. For the last three weeks, I have been explaining machine learning and the essence of it. Part one is found here and that is the intro, the second part can be found here. In part two I began explaining the machine learning used for supervised learning. For this final series, I will continue to explain and unsupervised learning.

I have begun explaining machine learning and compared it to a car and its engine. I began explaining supervised learning to automatic cars because of the need to not change the…


My simplistic blog about machine learning can be found here. I did a brief explanation of what machine learning is, and I compared it to a car and its engine.

The great thing about machine learning is multiple models that can be used. Of course, each model serves a different purpose just like each car is designed specifically. From each model we just are aware of two different designs, supervised and unsupervised. Let's compare supervised learning to automatic cars, and unsupervised learning to stick shift cars. Automatic cars don't need to change gears, meanwhile, stick shift cars give control on…


Machine learning is the coolest thing ever, once you actually understand it all. We can compare the beauty of machine learning to a car and its engine. We only see the car driving and its outer qualities but to truly appreciate a car, when you understand what is under the hood there is a deeper bond and appreciation to the car. When the car does not work, or when it breaks down what does someone do? Take to the mechanic.

Now comparing this to machine learning we must first define it. Machine learning is responsible for a majority of the…


“Not nice” square error is a little silly title to the commonly used mean square error. For those who haven't used mean square error, it is a commonly used estimator in statistics. The simple definition is, the MSE informs you of how close the set of points are to the regression line. It is the preferred estimator because it gives weight to larger values. This estimator is also preferred because it finds the ‘average of the set of errors. This means, the lower the MSE you get from a model or a regression line, the better it performs.

Formula

From the…


Feature importance is a beautiful technique in machine learning. Its job is basically its own name ‘Feature Importance’, identifying which features are the most important. The technique ranks the features from your dataset into which is important. This is essential for machine learning because it would be difficult to teach a model when one of the features basically gives the answer.

For example, let's say we want to build a model to help predict your ideal vacation places based on your interest. What if by accident we also mention your dream vacation. The model with know your dream locations and…


Imagine finding the perfect shoe! Grid search is an amazing feature! Doing something to maximize the accuracy of the chances of finding the perfect shoe! That is basically what grid search is.

To begin we must identify the variable. We being to label a random forest classifier and fit it with the X_train and Y_train. Once we fit it and identify the accuracy score we can consider it the base of what we want to improve. This can be the first shoe we find but we are still searching for the better pair of shoes.

Once we fit the classifier…


What is a confusion matrix you ask? It's a 2x2 table explaining the probabilities from a binary classification. Various measurements are obtained from the matrix. Measurements such as Recall, Precision, Specificity, Accuracy, and AUC-ROC Curve.

Technically there are only two types of classes that do arise from the matrix, Positives and Negatives. Now according to the visual to the left we have True Positives, False Positives, False Negatives, and True Negatives. To understand the four options I will explain with a visual.

Emilia Orellana

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store