Unsupervised Learning in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Unsupervised Learning in Python - Datacamp

    Notes by César Muro

    # Import the course packages
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats 
    
    # Import the course datasets 
    grains = pd.read_csv('datasets/grains.csv')
    fish = pd.read_csv('datasets/fish.csv', header=None)
    wine = pd.read_csv('datasets/wine.csv')
    eurovision = pd.read_csv('datasets/eurovision-2016.csv')
    stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
    digits = pd.read_csv('datasets/lcd-digits.csv', header=None)

    Unsupervised learning is a class of machine learning techniques for discovering patters in data without a specific prediction task in mind.
    Supervised learning finds patterns for a prediction task.

    k-means clustering

    • Finds clusters of samples
    • Number of clusters must be specified

    syntax:
    from sklearn.cluster import KMeans
    model=KMeans(n_clusters=3)
    model.fit(samples) labels=model.predict(samples)

    • New samples can be assigned to existing clusters
    • k-means remembers that mean of each cluster ("centroids")

    Explore Datasets

    Use the DataFrames imported in the first cell to explore the data and practice your skills!

    • You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the grains DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (the variety_number and variety columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python!
    • In the fish DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column).
    • In the wine DataFrame, there are three class_labels in this dataset. Transform the features to get the most accurate clustering.
    • In the eurovision DataFrame, perform hierarchical clustering of the voting countries using complete linkage and plot the resulting dendrogram.

    Clustering 2D points

    Create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

    You are given the array points from the previous exercise, and also an array new_points

    # Import KMeans
    from sklearn.cluster import KMeans
    
    
    # Create a KMeans instance with 3 clusters: model
    model = KMeans(n_clusters=3)
    
    # Fit model to points
    model.fit(points)
    # Determine the cluster labels of new_points: labels
    labels = model.predict(new_points)
    
    # Print cluster labels of new_points
    print(labels)
    # Import pyplot
    import matplotlib.pyplot as plt
    # Assign the columns of new_points: xs and ys
    xs = new_points[:,0]
    ys = new_points[:,1]
    
    # Make a scatter plot of xs and ys, using labels to define the colors
    plt.scatter(xs,ys,alpha=0.5,c=labels)
    
    # Assign the cluster centers: centroids
    centroids = model.cluster_centers_
    
    # Assign the columns of centroids: centroids_x, centroids_y
    centroids_x = centroids[:,0]
    centroids_y = centroids[:,1]
    
    # Make a scatter plot of centroids_x and centroids_y
    plt.scatter(centroids_x,centroids_y,marker='D',s=50)
    plt.show()

    Evaluating a clustering

    • Measure the quality of a clustering.
      We use "cross-tabulation"

    ct = pd.crosstabl(df['labels'],df['species'])

    • We need a way to measure the quality of a clustering that uses only the clusters and the samples themselves.
    • A good clustering has tight clusters
    • Samples in each cluster bunched together

    Inertia measures clustering quality

    • Measures how spread out the clusters are (lower is better)
    • Distance from each sample to centroi of its cluster
      from sklearn.cluster import KMeans
      model = KMeans(n_clusters=3)
      model.fit(samples)
      print(model.inertia_)

    In fact, kmeans aims to place the clusters in a way that minimizes the inertia

    • A good clustering has tight clusters (so low inertia)
    • ... but not too many clusters!
    • Choose an "elbow" in the inertia plot

    How many clusters of grain?

    In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

    import numpy as np
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    samples = grains.loc[:,~grains.columns.isin(['variety_number','variety'])].values
    ks = range(1, 6)
    inertias = []
    
    for k in ks:
        # Create a KMeans instance with k clusters: model
        model = KMeans(n_clusters=k)
        
        # Fit model to samples
        model.fit(samples)
        
        # Append the inertia to the list of inertias
        inertias.append(model.inertia_)
        
    # Plot ks vs inertias
    plt.plot(ks, inertias, '-o')
    plt.xlabel('number of clusters, k')
    plt.ylabel('inertia')
    plt.xticks(ks)
    plt.show()
    
    

    Evaluating the grain clustering

    In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

    varieties = grains['variety'].values
    # Create a KMeans model with 3 clusters: model
    model = KMeans(n_clusters=3)
    
    # Use fit_predict to fit model and obtain cluster labels: labels
    labels = model.fit_predict(samples)
    
    # Create a DataFrame with labels and varieties as columns: df
    df = pd.DataFrame({'labels': labels, 'varieties': varieties})
    
    # Create crosstab: ct
    ct = pd.crosstab(df['labels'],df['varieties'])
    
    # Display ct
    print(ct)