Rafael Monteiro/

Predicting abalone age and dealing with multicollinearity.


Can you estimate the age of an abalone?

📖 Background

You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.

Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.

💾 The data

You have access to the following historical data (source):

Abalone characteristics:
  • "sex" - M, F, and I (infant).
  • "length" - longest shell measurement.
  • "diameter" - perpendicular to the length.
  • "height" - measured with meat in the shell.
  • "whole_wt" - whole abalone weight.
  • "shucked_wt" - the weight of abalone meat.
  • "viscera_wt" - gut-weight.
  • "shell_wt" - the weight of the dried shell.
  • "rings" - number of rings in a shell cross-section.
  • "age" - the age of the abalone: the number of rings + 1.5.

Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).

Imports and Exploratory data analysis

!pip install pingouin &> /dev/null
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pingouin as pg
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

abalone = pd.read_csv('./data/abalone.csv')
sns.pairplot(abalone, hue = 'sex');

1. How does weight change with age for each of the three sex categories?

  • On average, for males, the ages with the highest weight are 25.5 (1.61 grams) and 28.5 (2.18 grams).
  • For females, 25.5 years (1.97 grams) and 30.5(1.8 grams).
  • For Infants, 18.5(1.08 grams) and 19.5(0.99 grams).
# Violin Plot of sex distributed by age
sns.violinplot(x='sex', y='age', data=abalone)
sexlista = ['Males','Females','Infants']

for i,k in zip(sexlista,['M','F','I']):
    dfinn = abalone[]
    print(f'Distribution of weight by age for {i}')
    df_age = dfinn[['sex','age','whole_wt']].groupby('age').mean()
    display(df_age.sort_values(by='whole_wt', ascending = False).head(3))
    df_age.plot(kind = 'bar',title = f'Distribution of weight by age for {i}')

2. Can you estimate an abalone's age using its physical characteristics?

# Print regression result

def regression_metrics(y_true, y_pred):
    # Regression metrics
    ev = metrics.explained_variance_score(y_true, y_pred)
    #msle = metrics.mean_squared_log_error(y_true, y_pred) 
    #penalizes an under-predicted estimate greater than an over-predicted estimate.
    mae = metrics.mean_absolute_error(y_true, y_pred)
    mdae = metrics.median_absolute_error(y_true, y_pred)
    mse = metrics.mean_squared_error(y_true, y_pred) 
    r2 = metrics.r2_score(y_true, y_pred)
    rmse = np.sqrt(mse)
    acc = (rmse/mae-1) #accuracy: For 0 (better) to sqrt(n)-1
    #acd = results.rsquared_adj # adjusted coefficient of determination
    #cod = results.rsquared  # coefficient of determination
    return ev, mae, mdae, mse, r2, rmse, acc

def regression_results(y_true, y_pred):
    ev, mae, mdae, mse, r2, rmse, acc = regression_metrics(y_true, y_pred)
    print('Explained Variance: ', round(ev, 6))    
    #print('Mean Square Log Error: ', round(msle, 6))
    # print('Mean Absolute Error: ', round(mae, 6))
    print('Median Absolute Error: ', round(mdae, 6))
    # print('Mean Squared Error: ', round(mse, 8))
    # print('Root Mean Squared Error: ', round(rmse, 6))
    print('Accuracy Acc(RMSE/MAE - 1): ', round(acc, 6))
    #print('Adjusted Coefficient of Determination (adjusted R2): ', round(acd, 6))
    #print('Coefficient of Determination: ', round(cod, 6))
# # Create dataframe with comparative results (actual vs predicted)    
def creates_df_compare(y_test,y_pred):
    df_compare = pd.DataFrame({'actual': y_test, 'pred': y_pred})
    df_compare['error'] = df_compare['actual'] - df_compare['pred']
    df_compare['residue'] = np.abs(df_compare['actual'] - df_compare['pred'])
    df_compare['residue_sqr'] = df_compare['residue']**2
    avg_actual = df_compare['actual'].mean()
    df_compare['var'] = (df_compare['actual'] - avg_actual)**2
    return df_compare

# Average Relative Error Chart - Real vs Predicted
def real_vs_predicted(yhat,y):
    ev, mae, mdae, mse, r2, rmse, acc = regression_metrics(y, yhat)
    x = list(range(len(y)))
    print("Mean Absolut Error:",mae)
    print("Mean Squared Error:", mse)
    print("Root Mean Squared Error:", rmse)
    print("R-Squared:", r2)
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(16,8))
    fig.suptitle('Original vs Predicted')

    ax1.plot(x, y, color="blue", label="original")
    ax1.set_ylabel('Original vs Predicted')
    ax1.plot(x, yhat, color="red", label="original")

    ax2.plot(x, y, color="blue", label="original")

    ax3.plot(x, yhat, color="red", label="predicted",)
cols = ['sex', 'length', 'diameter', 'height', 'whole_wt', 'shucked_wt','viscera_wt', 'shell_wt']
numcols = cols[1:]

df_norm = abalone[cols].copy()

# To eliminate possible problems of scale difference between the variables, we will apply the standardization.
scaler = preprocessing.StandardScaler()
# Apply standardization to numeric columns
df_norm[numcols] = scaler.fit_transform(df_norm[numcols])
# drop_first=True to prevent the linear regression matrix from being invertible (non-singular)
# we must have one column less than the categories
x_scaled = pd.get_dummies(df_norm, drop_first=True)

y = abalone['age']

regressor = LinearRegression()
results =,y)
y_pred = results.predict(x_scaled)

regression_results(y, y_pred)

display(creates_df_compare(y, y_pred).round(4))

real_vs_predicted(y_pred, y)

3. Investigate which variables are better predictors of age for abalones.

Correlation between numeric variables.

fig, ax = plt.subplots(figsize=(10,5))
dfcorr = abalone.corr()
sns.heatmap(dfcorr, annot = True, cmap = 'PuBu');

Partial Correlation between numeric variables.

fig, ax = plt.subplots(figsize=(10,5))
sns.heatmap(abalone.pcorr(), annot = True, cmap = 'PuBu');

Analyzing the importance of each variable for the prediction of age through elasticity, indicating the positive or negative influence by the sign.

  • In this analysis, it is observed that the male gender has a positive influence, even if small, in determining the age of Abalone.
coeff_df = pd.DataFrame(results.coef_, x_scaled.columns, columns=['coefficient'])


coeff_df = coeff_df.rename(columns = {'index':'feature'})


ax = sns.barplot(x = 'coefficient', y =  'feature', 
                 data = coeff_df.sort_values(by='coefficient', 
                                             ascending = False))

Analyzing the importance of each variable for age prediction by the absolute value of elasticity.

coeff_df['coeff_abs'] = np.abs(coeff_df['coefficient'])

coeff_df['coeff_perc'] = coeff_df['coeff_abs'] / np.sum(coeff_df['coeff_abs'])


ax = sns.barplot(x = 'coeff_perc', y =  'feature', 
                 data = coeff_df.sort_values(by='coeff_abs', 
                                             ascending = False))

Applying PCA to remove multicollinearity from variables.

As we have a high correlation between the dependent variables, we may have the multicollinearity problem. This can generate an increase in the standard errors of the coefficients. The increase in standard errors, in turn, means that the coefficients for some independent variables may not be significantly different from 0.

# split train and test
X_train, X_test, y_train, y_test = train_test_split( x_scaled, y, test_size=0.25, random_state=42)

pca = PCA(n_components = x_scaled.shape[1])
pca_data = pca.fit_transform(X_train)

percent_var_explained = pca.explained_variance_/(np.sum(pca.explained_variance_))
cumm_var_explained = np.cumsum(percent_var_explained)

plt.ylabel("% variance explained")

Since 97.7% of the total variance is captured by the 1st 5 PCA components itself, we take only 5 components of PCA and compute a correlation heatmap to overserve the multicollinearity.

Correlation after PCA: