Duplicate of Competition - Abalone Seafood Farming

Beta

Can you estimate abalone age?

Image(filename='Diseño sin título(1).png')

1.Introduction

Abalone is a shellfish considered a delicacy in many parts of the world. An excellent source of iron and pantothenic acid, and a nutritious food resource and farming in Australia, America and East Asia. 100 grams of abalone yields more than 20% recommended daily intake of these nutrients. The economic value of abalone is positively correlated with its age. Therefore, to detect the age of abalone accurately is important for both farmers and customers to determine its price. However, the current technology to decide the age is quite costly and inefficient. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a laborious task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. However, for this problem we shall assume that the abalone's physical measurements are sufficient to provide an accurate age prediction.

Paper objectives:

How does weight change with age for each of the three sex categories?
Can you estimate an abalone's age using its physical characteristics?
Investigate which variables are better predictors of age for abalones.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as lines
from scipy.stats import iqr
from skimage import io

from scipy.stats import skew, kurtosis
pd.set_option("display.max_columns",None) 
pd.set_option("display.max_rows",None) 
from sklearn.neighbors import LocalOutlierFactor


from warnings import filterwarnings
filterwarnings('ignore')

sns.set_style('white')
plt.rcParams['font.family'] = 'monospace'

from scipy.stats import zscore
from scipy.stats import iqr
from scipy import stats
from IPython.display import Image

blues = ['#193f6e','#3b6ba5','#72a5d3','#b1d3e3','#e1ebec']
reds = ['#e61010','#e65010','#e68d10','#e6df10','#c2e610']
cmap_blues = sns.color_palette(blues)
cmap_reds = sns.color_palette(reds)
sns.set_palette(cmap_blues)

print('These are color palette I will use in it:')
sns.palplot(cmap_blues)
sns.palplot(cmap_reds)

2.Data preparation

2.1 Features of data

The dataset has 4177 entries and 10 columns:

Feature	Data Type	Measurement	Description
`sex`	categorical		M, F, and I (Infant)
`length`	continuous	mm	longest shell measurement
`diameter`	continuous	mm	perpendicular to the length
`height`	continuous	mm	measured with meat in the shell
`whole_wt`	continuous	grams	whole abalone weight
`shucked_wt`	continuous	grams	the weight of abalone meat
`viscera_wt`	continuous	grams	gut-weight
`shell_wt`	continuous	grams	the weight of the dried shell
`rings`	continuous		number of rings in a shell cross-section
`age`	continuous		the age of the abalone: the number of rings + 1.5

2.2 General information

Now we can see all the general information of the dataset. First we will see the first 5 rows of the dataset. We will go through the typology, we will see that there are no duplicate data and that there are no missing values.

Hidden code

print('💠 Are there missing values?\n')
bg_color = '#fbfbfb'
txt_color = '#5c5c5c'
# check for missing values
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

mv = abalone.isna()
ax = sns.heatmap(data=mv, cmap=cmap_reds, cbar=False, ax=ax, )

ax.set_ylabel('')
ax.set_yticks([])
ax.set_xticklabels(labels=mv.columns, size=12,rotation=45)
ax.tick_params(length=0)

fig.text(
    s=':Missing Values',
    x=0, y=1.1,
    fontsize=17, fontweight='bold',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    we can't see any ...
    ''',
    x=0, y=1.075,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

Hidden code

2.3 Data preprocessing

2.3.1 Data typology and single visualization

2.3.1.1 Categorical data

The only categorical feature is sex. It is divided into three subcategories: male, female and infant. As can be seen, the distributions between the three categories is homogeneous. The noteworthy fact is that the female subcategory has a lower mean than the other two.

‌
‌
‌