Unsupervised Learning in Python - Datacamp
Notes by César Muro
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
# Import the course datasets
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)
Unsupervised learning is a class of machine learning techniques for discovering patters in data without a specific prediction task in mind.
Supervised learning finds patterns for a prediction task.
k-means clustering
- Finds clusters of samples
- Number of clusters must be specified
syntax:
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3)
model.fit(samples)
labels=model.predict(samples)
- New samples can be assigned to existing clusters
- k-means remembers that mean of each cluster ("centroids")
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the
grains
DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (thevariety_number
andvariety
columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python! - In the
fish
DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column). - In the
wine
DataFrame, there are threeclass_labels
in this dataset. Transform the features to get the most accurate clustering. - In the
eurovision
DataFrame, perform hierarchical clustering of the voting countries usingcomplete
linkage and plot the resulting dendrogram.
Clustering 2D points
Create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.
You are given the array points from the previous exercise, and also an array new_points
# Import KMeans
from sklearn.cluster import KMeans
# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)
# Fit model to points
model.fit(points)
# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)
# Print cluster labels of new_points
print(labels)
# Import pyplot
import matplotlib.pyplot as plt
# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,alpha=0.5,c=labels)
# Assign the cluster centers: centroids
centroids = model.cluster_centers_
# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x,centroids_y,marker='D',s=50)
plt.show()
Evaluating a clustering
- Measure the quality of a clustering.
We use "cross-tabulation"
ct = pd.crosstabl(df['labels'],df['species'])
- We need a way to measure the quality of a clustering that uses only the clusters and the samples themselves.
- A good clustering has tight clusters
- Samples in each cluster bunched together
Inertia measures clustering quality
- Measures how spread out the clusters are (lower is better)
- Distance from each sample to centroi of its cluster
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)
In fact, kmeans aims to place the clusters in a way that minimizes the inertia
- A good clustering has tight clusters (so low inertia)
- ... but not too many clusters!
- Choose an "elbow" in the inertia plot
How many clusters of grain?
In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
samples = grains.loc[:,~grains.columns.isin(['variety_number','variety'])].values
ks = range(1, 6)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=k)
# Fit model to samples
model.fit(samples)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
Evaluating the grain clustering
In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.
varieties = grains['variety'].values
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)
# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])
# Display ct
print(ct)