Beta

## Unsupervised Learning in Python

Run the hidden code cell below to import the data used in this course.

```
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy.stats
# Import the course datasets
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)
```

### Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

- You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the
`grains`

DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (the`variety_number`

and`variety`

columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python! - In the
`fish`

DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column). - In the
`wine`

DataFrame, there are three`class_labels`

in this dataset. Transform the features to get the most accurate clustering. - In the
`eurovision`

DataFrame, perform hierarchical clustering of the voting countries using`complete`

linkage and plot the resulting dendrogram.

### Clustering grains

```
# Print the number of classes in grains
print(grains.nunique())
```

`grains.head(3)`

K-Means: find cluster of samples, the number of clusters must be specified

```
import pandas as pd
from sklearn.cluster import KMeans
samples = grains.iloc[:, :7]
model = KMeans(n_clusters=3, random_state=42)
model.fit(samples)
# Predict which class each sample belongs
labels = model.predict(samples)
print(labels)
```

KMeans can predict which cluster label new unsee sample data by remembering the mean of each cluster (centroids). It will find the nearest centroid to each new sample

#### Quality of Clusters

```
import matplotlib.pyplot as plt
# Assign the columns of new_points: xs and ys
xs = samples.iloc[:, 0]
ys = samples.iloc[:, 1]
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)
centroids = model.cluster_centers_
x = centroids[:, 0]
y = centroids[:, 1]
plt.scatter(x, y)
plt.title("KMeans clsutering")
plt.show()
```

```
df = pd.DataFrame({'labels': labels, 'variety': grains['variety']})
print(df)
```

```
ct = pd.crosstab(df['labels'], df['variety'])
print(ct)
```

Inertia: measures how far samples are from their centroids. Lower this value better the quality of model

`print(model.inertia_)`