James Mwangi/

Supervised Learning with scikit-learn


Supervised Learning with scikit-learn

# Importing modules
import pandas as pd
import numpy as np 

# Importing the course datasets 
diabetes_df = pd.read_csv('datasets/diabetes_clean.csv') # 3
music_df = pd.read_csv('datasets/music_clean.csv') # 4
sales_df = pd.read_csv('datasets/advertising_and_sales_clean.csv') # 2
churn_df = pd.read_csv("datasets/telecom_churn_clean.csv") # 1

1 . Classification

1.1 Binary classification

  • In the video, you saw that there are two types of supervised learning — classification and regression.

  • Recall that binary classification is used to predict a target variable that has only two labels, typically represented numerically with a zero or a one.

  • A dataset, churn_df, has been preloaded for you in the console.

Your task is to examine the data and choose which column could be the target variable for binary classification.


Correct! churn has values of 0 or 1, so it can be predicted using a binary classification model.

1.1.2 The supervised learning workflow

  • Recall that scikit-learn offers a repeatable workflow for using supervised learning models to predict the target variable values when presented with new data.

  • Reorder the pseudo-code provided so it accurately represents the workflow of building a supervised learning model and making predictions.

1 => from sklearn import Model

2 => model Model()

3 => X , y )

4 => model.predict(X_new)

Great work! You can see how scikit-learn enables predictions to be made in only a few lines of code!

1.2 The classification challenge

1.2.1 k-Nearest Neighbors: Fit

  • In this exercise, you will build your first classification model using the churn_df dataset, which has been preloaded for the remainder of the chapter.

  • The features to use will be "account_length" and "customer_service_calls".

  • The target, "churn", needs to be a single column with the same number of observations as the feature data.

  • You will convert the features and the target variable into NumPy arrays, create an instance of a KNN classifier, and then fit it to the data.

# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier 

# Create arrays for the features and the target variable
y = churn_df["churn"].values
X = churn_df[["account_length","customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data, y)

Excellent! Now that your KNN classifier has been fit to the data, it can be used to predict the labels of new data points.

1.2.2 k-Nearest Neighbors: Predict

  • Now you have fit a KNN classifier, you can use it to predict the label of new data points.

  • All available data was used for training, however, fortunately, there are new observations available.

  • These have been preloaded for you as X_new .

  • The model knn, which you created and fit the data in the last exercise, has been preloaded for you.

  • You will use your classifier to predict the labels of a set of new data points:

X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])
# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions for X_new
print("Predictions: {}".format(y_pred)) 

Great work! The model has predicted the first and third customers will not churn in the new array. But how do we know how accurate these predictions are? Let's explore how to measure a model's performance.

1.3 Measuring model performance

1.3.1 Train/test split + computing accuracy

  • Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the churn_df dataset!

  • NumPy arrays have been created for you containing the features as X and the target variable as y.

  • You will split them into training and test sets, fit a KNN classifier to the training data, and then compute its accuracy on the test data using the .score() method.

# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split( X , y, test_size= 0.2 , random_state= 42 , stratify= y )
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data , y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

Excellent! In a few lines of code you split a dataset, fit a KNN model, and found its accuracy to be 87%!

1.3.2 Overfitting and underfitting

  • Interpreting model complexity is a great way to evaluate performance when utilizing supervised learning.

  • Your aim is to produce a model that can interpret the relationship between features and the target variable, as well as generalize well when exposed to new observations.

  • You will generate accuracy scores for the training and test sets using a KNN classifier with different n_neighbor values, which you will plot in the next exercise.

  • The training and test sets have been created from the churn_df dataset and preloaded as X_train , X_test, y_train, and y_test .

In addition, KNeighborsClassifier has been imported for you along with numpy as np .

# Create neighbors
neighbors = np.arange( 1 , 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
	# Set up a KNN Classifier
	knn = KNeighborsClassifier( n_neighbors = neighbor )
	# Fit the model X_train, y_train )
	# Compute accuracy
	train_accuracies[neighbor] = knn.score( X_train , y_train )
	test_accuracies[neighbor] = knn.score( X_test , y_test ) 
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)