How Old Is This Abalone?
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    1. Background

    Farming of abalone has become more popular over the years and there have been increasingly successful endeavors to commercially farm abalone for the purpose of consumption. The principal abalone farming region is the continent of Asia.

    For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.

    Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, I am interested in helping the farmers estimate the age of the abalone using its physical characteristics.

    I have access to the following historical data (source):

    Abalone characteristics:

    • "sex" - M, F, and I (infant).
    • "length" - longest shell measurement.
    • "diameter" - perpendicular to the length.
    • "height" - measured with meat in the shell.
    • "whole_wt" - whole abalone weight.
    • "shucked_wt" - the weight of abalone meat.
    • "viscera_wt" - gut-weight.
    • "shell_wt" - the weight of the dried shell.
    • "rings" - number of rings in a shell cross-section.
    • "age" - the age of the abalone: the number of rings + 1.5.

    Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).

    # Import needed modules
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    abalone = pd.read_csv('./data/abalone.csv')
    
    # Print out few samples of the abalones
    abalone.head()
    # Print DataFrame information
    abalone.info()

    From our first glance at the data, we observe that all the columns in the dataset are numerical features except the 'sex' column, which is a categorical feature. It is also a great advantage that there are no missing values in the dataset, i.e. all values for the 10 features of all 4,177 abalone samples are available.

    # Print summary statistics
    abalone.describe()
    # Print the dimension of the dataset
    dimension = abalone.shape
    print(f"The dataset has:\n\t{dimension[0]} rows (Abalone samples)\n\t{dimension[1]} columns (Abalone features)")

    2. Investigating the abalone dataset.

    We have a pretty good starting point, we'll be inspecting these features to uncover more subtle details that will be helpful to the performance of our model.

    Firstly, we will figure out the most important features for predicting the age of an abalone. It is useful to establish a correlation between the response variable (in our case the age of an abalone) and other predictor variables so as to remove features that will not be helpful to our model. There are many ways to discover correlation between the target variable and the rest of the features. Building pair plots, scatter plots, heat maps, and a correlation matrixes are the most common ones. Below, we will use the corr() function to list the top features based on the pearson correlation coefficient (measures how closely two sequences of numbers are correlated).

    # Calculate the correlations of other features to the age
    abalone_corr = abalone.corr()['age'][:-1]
    
    # Print the correlations of each feature to the age of an abalone
    abalone_corr.sort_values(ascending=False)

    All the features but 'shucked_wt' (weight of abalone meat) have coefficients greater than 0.5, which means that they are highly correlated with the age of abalone. More interestingly, the "rings" feature has a coefficient of 1, and this implies that it is the parameter with the greater influence on the age.

    Next, we'll generate some pair plots to visually inspect these correlations. Building plots is also one of the possible ways to spot the outliers that might be present in the data.

    # Create the plots; relationship of each feature to the age
    plt.style.use('ggplot')
    fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True)
    
    for i, ax in zip(abalone_corr.index, axes.ravel()):
        ax.plot(abalone[i], abalone.age, marker=".", linestyle="none", markersize=8, color="b", alpha=0.2)
        ax.set_title(i + " by age", fontsize=10)
        #ax.set_xticks([], [])#; ax.set_yticks([], [])

    We can confirm from the plots that virtually all of the features are correlated to the age of the abalone i.e. there's an increase in the age with increase in feature value, and we also see that the number of rings has the best linear correlation (a good predictive model can be built using just this feature).

    Also, we spot 2 outliers in the 'height' column with value more than 0.5. These are not feasible values for the height feature (the value measured with meat in the shell), and we also see that these very high heights doesn't correspond to high abalone ages. Therefore, we will go ahead and remove these samples so as to eliminate noise in our data.

    # Check out the outliers
    outlier_h = abalone.height > 0.5
    outlier_data = abalone[outlier_h]
    
    # Print the outliers' data
    outlier_data
    # Remove the outliers from the data
    n_abalone = abalone[~outlier_h]
    
    # Print out the new shape of the dataset
    print(f"The new dimension of the data: {n_abalone.shape}\n\tWith {n_abalone.shape[0]} abalone samples")