Siobhan McMillan/

Supervised Learning with scikit-learn


Supervised Learning with scikit-learn

Run the hidden code cell below to import the data used in this course.

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here

we can define an error function - loss/cost functions

we want line to be as close as possible - minimise distance between point and line - Residuals

we square the residuals so that the points below the line dont cancel out the points above the line - this is called Ordinary Least Squares an minimises the Residual sum of Squares (RSS)

With two features and one target y = a1x1 + a2x2 + b

b = intercept

this is called multiple linear regressions.

we only pass one feature to each variable y + X so you need to make an array for X of all features.

R Squared

R2 quantidies the variance in target values that is explained by the features. this ranges from 0 - 1 - when 1 completely explains the targets variance. 1 is a good score 0 is not. We want the features to explain the variance.


Mean squared error - to assess the models performance take the mean of the RSS

Root Mean Squared Error - takes the square root of MSE to convert it into the target units(?) this isnt clearly explained.

I think its to get the average error

Cross Validation


the rsquared return is dependant on how we split the data - it wont perform well on weird unseen data.

basically you can split the data into several different way and then use all but 1 section to train the data and rotate round each time - you can then work out any points of interest and confidence intervals

THis is known as K-folds cross validation.

This is done in python using cross_val_score & K_fold from the sklearn.model_selection

Kfold allows you to alocate splits and seeds


Large coefficients can lead to overfitting - we want to edit the loss function to punish large coefficients. This is regularisation.

Ridge Regression

Ridge uses Ordinary Less Squares + the squared value of each coefficient multiplied by a constant alpha.

models are penalised for large positive and negative values. We need to pick the alpha value - usually for the one that suits the model best - siimilar to how you pick one for K in K nearest neighsbours. You have to be careful as a high alpha will remove any large coefficients and lean to underfitting whereas a low alpha leads to overfitting.

import Ridge from sklearn.linear_model use a loop to assign a range of allpgas that is then used to score models we can then see the best performace.

Lasso Regression

Ordinary Less Squares + absolute value of each coefficient multiplied by a constant alpha.

import from linear model and do the same as ridge regression.

Lasso can be used to look a feature importance as it will shrink the coefficents of useless functions to zero.

using .coef_ we can see the coefficients.

How Good is your movel

Classification metrics

we can use accuracy to mesure model performance - if youre looking for fraud you could have a model that predicts no transactions are fraud - which is 99% accurate - but its terrible at detecting fraud. This is due to class imblance - where one class is more frequent than others.

Confusion matrixs.

True negatives false positive flase negatives true positive

we can get accuracy from this matrixs


Used for models where you want it to be as correct as possible but its not too bad if its negative i.e. we want to detect cancer as accurately as possible but a few false positives are not the worst thing out there

true positives / true positives + false positives high precision means there is a lower false positive rate (performs well)


When you don't care about the real 0s - ie we don't care about normal bank transactions but really want to ofcus on the real 1s as often as we can. or if someone is ill and we falsly say tthey are fine (false negative) they are a risk to spreading the illness to others.

true positives / true positives + false negatives high recalls means there is a lower false negative rate (performs well)

f1 score

this mertic looks at both errors made by the models and favours models with a similar precision and recall.

2 * (prevision * recall) / (precision + recall)

We can see all these metrics using classification report

Logistic Regression

Logistic Regression

Logistic regression is used for classification. This model calculates probability that an observation belongs to a binary class. if p >0.5 1 if less than 0.5 0.

import logistic regression

prediction probabilities

returns 2D array with the probablities of the observation - ie will they churn or not

the default is 0.5 - we might want to edit this threashold.

ROC curve

the Reciever operating characteristic curve ROC Curve, is used to visualise what might happen to the positive and negative rates as the threshold changes.

when threshold =0 it will correctly predict all positive values when it equals 1 it will correcltly predict all negative values

to plot this we import roc_curve from the metrics. test lables and probablities are the features.

plot a line from 0-1


Area under the Curve - we want to be as close to 1 as possible.

roc_auc_score imported from the metrics

Hyperparameter tuning

How we can optimise the model by changing the parameters we specify before we fit a model (k or alpha).

we use crossvalidation when doing this to avoid overfitting. We dont use cross validation on teh test model.

Grid search

Chose a grid of hyper parameters to try. For KNN we have two hyperparameters - the type of metric and number of neighbours.

eg n_neighbours = 2-11 in increments of 3 and two metrics euclidean and manhattan. we then chose hyperparameters that work best.

from model selection import GridSearchCV

  • we put our gridsearch values in a dictionary


this doesnt scale well 10 cross folds on 3 parameters with 10 values = 900


this picks random hyperparameters rather than searching all values. this makes it more scalable. from model selection import RandomizedSearchCV

we can select n_iter to the number of hyper parameters we want to test at each time.

we can then evaluate on the test set

Preprocessing Data

for Scikit learn we cannot have null or missing values and it must be numeric. Therefor we might need to do some cleaning.

scikit learn wont accept strings such as colours - we might need to turn these into numerical data - this is a dummy variable.

to create dummy variables we can use OneHotEncoder or pandas GetDummys

pd.get_dummies and pass column you want to encode

we can then concat these figures onto the original dataframe and then clear out the column you used to make this.

once we have numerical values we can use linear regression.

Scikit learns cross validation metrics presume a higher score is better so using negative mse counteracts this - we can then calculate RMSE byt taking the square root and converting to positive by using sqrt with a - sign in front.

If the average RMSE is lower than the standard deviation it suggest the model is somewhat accurate.

missing data

A common approach is to remove missing observations accounting for less than 5% of all data. If there are missing values the entire row is missing.


Making an educated guess to what the values could be - use means or median, for categorical we might use the most frequent. We must split training and test set to avoid data leaking.

split into categorical and numerical sets - split them and then we can use SimpleImputer - most frequent and then fit the transform. for numerical data we might use the mean - we then combine the training data using the append function and repeat for test data.

Imputers are known as Transformers.


import Pipeline from SK learn

to build a pipeline we build a bunch of steps using tuples and pass list to pipeline, split data and fit to pipeline

Centering and Scaling

.describe to look at the ranges of the data in our dataaframe.

why scale our data

models use distance to inform them - wide ranges means that it will influence the model (knn)

we want to standardise or normalise our data - this is known as scaling

we can substract then mean and divide by the variance we can normalise between 1 ands -1

Standard scaler from preprocessing

we can also put a scaler in a pipeline.

we can also optimise our hyper parameters in a pipeline

Choosing a model

Fewer features = simpler models more data = larger more complex models

Do we need to explain our models

KNN doesnt do linear modelling

We can pick several models and then analyse them then pick the best model

  • scale before you analyse!!!

we can use a loop with different models in a dictionary to test the models on accuracy and then plot the results.

  • AI Chat
  • Code