Unsupervised Learning in Python

Beta

Unsupervised Learning in Python - Datacamp

Notes by César Muro

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats 

# Import the course datasets 
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)

Unsupervised learning is a class of machine learning techniques for discovering patters in data without a specific prediction task in mind.
Supervised learning finds patterns for a prediction task.

k-means clustering

Finds clusters of samples
Number of clusters must be specified

syntax:
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3)
model.fit(samples) labels=model.predict(samples)

New samples can be assigned to existing clusters
k-means remembers that mean of each cluster ("centroids")

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the grains DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (the variety_number and variety columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python!
In the fish DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column).
In the wine DataFrame, there are three class_labels in this dataset. Transform the features to get the most accurate clustering.
In the eurovision DataFrame, perform hierarchical clustering of the voting countries using complete linkage and plot the resulting dendrogram.

Clustering 2D points

Create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points

# Import KMeans
from sklearn.cluster import KMeans


# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)
# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

# Import pyplot
import matplotlib.pyplot as plt
# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,alpha=0.5,c=labels)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x,centroids_y,marker='D',s=50)
plt.show()

Evaluating a clustering

Measure the quality of a clustering.
We use "cross-tabulation"

ct = pd.crosstabl(df['labels'],df['species'])

We need a way to measure the quality of a clustering that uses only the clusters and the samples themselves.
A good clustering has tight clusters
Samples in each cluster bunched together

Inertia measures clustering quality

Measures how spread out the clusters are (lower is better)
Distance from each sample to centroi of its cluster
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)

In fact, kmeans aims to place the clusters in a way that minimizes the inertia

A good clustering has tight clusters (so low inertia)
... but not too many clusters!
Choose an "elbow" in the inertia plot

How many clusters of grain?

In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
samples = grains.loc[:,~grains.columns.isin(['variety_number','variety'])].values
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Evaluating the grain clustering

In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

varieties = grains['variety'].values
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)

‌
‌
‌

Unsupervised Learning in Python

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Unsupervised Learning in Python - Datacamp

Notes by César Muro

Explore Datasets

Evaluating a clustering

Unsupervised Learning in Python - Datacamp