Matheus Cerqueira/

Unsupervised Learning in Python


Unsupervised Learning in Python

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy.stats 

# Import the course datasets 
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the grains DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (the variety_number and variety columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python!
  • In the fish DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column).
  • In the wine DataFrame, there are three class_labels in this dataset. Transform the features to get the most accurate clustering.
  • In the eurovision DataFrame, perform hierarchical clustering of the voting countries using complete linkage and plot the resulting dendrogram.

Clustering grains

# Print the number of classes in grains

K-Means: find cluster of samples, the number of clusters must be specified

import pandas as pd
from sklearn.cluster import KMeans

samples = grains.iloc[:, :7]
model = KMeans(n_clusters=3, random_state=42)

# Predict which class each sample belongs
labels = model.predict(samples)

KMeans can predict which cluster label new unsee sample data by remembering the mean of each cluster (centroids). It will find the nearest centroid to each new sample

Quality of Clusters

import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = samples.iloc[:, 0]
ys = samples.iloc[:, 1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

centroids = model.cluster_centers_
x = centroids[:, 0]
y = centroids[:, 1]
plt.scatter(x, y)
plt.title("KMeans clsutering")
df = pd.DataFrame({'labels': labels, 'variety': grains['variety']})
ct = pd.crosstab(df['labels'], df['variety'])

Inertia: measures how far samples are from their centroids. Lower this value better the quality of model


  • AI Chat
  • Code