Lawrence Bowles/

Competition - Abalone Seafood Farming


Can you estimate the age of an abalone?

📖 Background

You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.

Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.

💾 The data

You have access to the following historical data (source):

Abalone characteristics:
  • "sex" - M, F, and I (infant).
  • "length" - longest shell measurement.
  • "diameter" - perpendicular to the length.
  • "height" - measured with meat in the shell.
  • "whole_wt" - whole abalone weight.
  • "shucked_wt" - the weight of abalone meat.
  • "viscera_wt" - gut-weight.
  • "shell_wt" - the weight of the dried shell.
  • "rings" - number of rings in a shell cross-section.
  • "age" - the age of the abalone: the number of rings + 1.5.

Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).

import pandas as pd
abalone = pd.read_csv('./data/abalone.csv')

💪 Competition challenge

Create a report that covers the following:

  1. How does weight change with age for each of the three sex categories?
  2. Can you estimate an abalone's age using its physical characteristics?
  3. Investigate which variables are better predictors of age for abalones.

🧑‍⚖️ Judging criteria

This is a community-based competition. The top 5 most upvoted entries will win.

The winners will receive DataCamp merchandise.

✅ Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

Understanding the data

Before doing anything else, we need to understand what data we're working with. Start by calculating descriptive statistics for the data:


We can see already the data has some issues we'll need to fix before modelling. Height clearly contains outliers, for example: the mininum and maximum values are well outside the rest of the data.

Let's see what this looks like using Seaborn's pairplot.

import matplotlib.pyplot as plt
import seaborn as sns


The outliers in height are plain to see. We should remove those before performing any analysis.

import numpy as np

height_mean = np.mean(abalone['height'])
height_std = np.std(abalone['height'])

height_upper = height_mean + 3 * height_std
height_lower = height_mean - 3 * height_std

trimmed_abalone = abalone[(abalone['height'] > height_lower) & (abalone['height'] < height_upper)]

That filter removed five records.

Now let's look at the pairplot again to see if there are any other outliers.


Although the values all have similar orders of magnitude, it's still prudent to scale before we start analysis.

Transform the data using StandardScaler.

from sklearn.preprocessing import StandardScaler

SScaler = StandardScaler()[['length', 'diameter', 'height', 'whole_wt', 'shucked_wt', 'viscera_wt', 'shell_wt']])
trimmed_abalone[['length_SS', 'diameter_SS', 'height_SS', 'whole_wt_SS', 'shucked_wt_SS', 'viscera_wt_SS', 'shell_wt_SS']] = \
    SScaler.transform(trimmed_abalone[['length', 'diameter', 'height', 'whole_wt', 'shucked_wt', 'viscera_wt', 'shell_wt']])
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = trimmed_abalone[['length_SS', 'diameter_SS', 'height_SS', 'whole_wt_SS', 'shucked_wt_SS', 'viscera_wt_SS', 'shell_wt_SS']]
y = trimmed_abalone['age']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

lr = LinearRegression(), y_train)
y_pred = lr.predict(X_test)
print("Training data R^2: {}".format(lr.score(X_train, y_train)))