Can you estimate the age of an abalone?
📖 Background
You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.
Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.
💾 The data
You have access to the following historical data (source):
Abalone characteristics:
- "sex" - M, F, and I (infant).
- "length" - longest shell measurement.
- "diameter" - perpendicular to the length.
- "height" - measured with meat in the shell.
- "whole_wt" - whole abalone weight.
- "shucked_wt" - the weight of abalone meat.
- "viscera_wt" - gut-weight.
- "shell_wt" - the weight of the dried shell.
- "rings" - number of rings in a shell cross-section.
- "age" - the age of the abalone: the number of rings + 1.5.
Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).
💪 Competition challenge
Create a report that covers the following:
- How does weight change with age for each of the three sex categories?
- Can you estimate an abalone's age using its physical characteristics?
- Investigate which variables are better predictors of age for abalones.
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Check that all the cells run without error.
suppressPackageStartupMessages(library(tidyverse))
abalone <- readr::read_csv('data/abalone.csv', show_col_types = FALSE)
install.packages("ggridges")
install.packages("ggcorrplot")
install.packages("grid")
install.packages("gridExtra")
install.packages("rstanarm")
install.packages("ggfortify")
library(ggridges)
library(ggcorrplot)
library(grid)
library(gridExtra)
library(rstanarm)
library(ggfortify)
I. Introduction
I.1 Context
For operational and environmental reasons, it is an important consideration for Japanese farming business entities to estimate the age of the abalones when they go to market. Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. This method however is cumbersome and complex.
I.2 Analysis purpose
The purpose of this analysis is to find less complex and less time consuming way of estimating the age of abalones using their physical characteristics instead of cumbersome rings counting.
I.3 Questions to analyze
- How does weight change with age for each of the three sex categories?
- Can you estimate an abalone's age using its physical characteristics?
- Which variables are better predictors of age for abalones.
II. Dataset
II.1 Dataset overview
Dataset provided contains observations with abalones physical characteristics.
Below table presents a glimpse at the dataset.
Abalone characteristics - variables used in the analysis::
- "sex" - M, F, and I (infant).
- "length" - longest shell measurement
- "diameter" - perpendicular to the length
- "height" - measured with meat in the shell
- "whole_wt" - whole abalone weight
- "shucked_wt" - the weight of abalone meat
- "viscera_wt" - gut-weight
- "shell_wt" - the weight of the dried shell
- "rings" - number of rings in a shell cross-section
- "age" - the age of the abalone: the number of rings + 1.5
Terminology note:
This analysis will focus on applying regression methods in order to describe abalones age using physical characteristics provided in the dataset. Therefore, for the purpose of the analysis age is to be considered a response / dependent variable and all other features to be explanatory / independent variables.
Dataset content:
- 4,177 rows - observations of distinct abalones
- 10 columns - variables describing each observation
- numeric types of variables dominate
'Table.1 Rows and columns (before cleaning)'
glimpse(abalone)
II.2 Summary statistics
Below table present summary statistics for each variable.
General dataset characteristics:
- no missing data
- maximum values for different characteristics significantly higher than 3rd quartile estimates indicate potential problem with outliers (to be investigated)
- zero values in height variable are to be removed
Table.1 Summary statistics
abaloneSummary <- as.data.frame(summary(abalone))
abaloneSummary$Var2 <- gsub("\\s+", "", abaloneSummary$Var2)
abaloneSummary <- abaloneSummary %>%
select(Var2, Freq) %>%
replace(is.na(.), '') %>%
mutate(time = rep(1:6,10)) %>%
spread(Var2,Freq) %>%
select(-time) %>%
relocate('sex')
abaloneSummary
II.3 Scaling
When values of abalone characteristics would be on different scales, it would be a good idea to scale the data. Nevertheless, looking at distributions of independent variables, they seem to share similar scale, so scaling will not be performed for the purpose of this analysis.
### DATASET MODIFICATIONS ###
abaloneModified <- abalone %>%
filter(length > 0,
diameter > 0,
height > 0,
whole_wt > 0,
shucked_wt > 0,
viscera_wt > 0,
shell_wt > 0,
age > 0) %>%
mutate(lengthLog = log(length),
diameterLog = log(diameter),
heightLog = log(height),
whole_wtLog = log(whole_wt),
shucked_wtLog = log(shucked_wt),
viscera_wtLog = log(viscera_wt),
shell_wtLog = log(shell_wt),
ageLog = log(age),
lengthSqrt = sqrt(length),
diameterSqrt = sqrt(diameter),
heightSqrt = sqrt(height),
whole_wtSqrt = sqrt(whole_wt),
shucked_wtSqrt = sqrt(shucked_wt),
viscera_wtSqrt = sqrt(viscera_wt),
shell_wtSqrt = sqrt(shell_wt),
ageSqrt = sqrt(age),
lengthExp = exp(length),
diameterExp = exp(diameter),
heightExp = exp(height),
whole_wtExp = exp(whole_wt),
shucked_wtExp = exp(shucked_wt),
viscera_wtExp = exp(viscera_wt),
shell_wtExp = exp(shell_wt),
ageExp = exp(age))
### BOX PLOT FUNCTION ###
funBox <- function(column,
color,
sexVector = c('I','F','M'),
annotateFlag = FALSE,
sexName, xAxisMax, xLabelsFlag) {
abaloneModifiedFun <- abaloneModified %>%
filter(sex %in% sexVector)
abaloneModifiedVector <- abaloneModifiedFun %>%
select(!!column) %>%
as.matrix()
IQR = 1.5 * IQR(abaloneModifiedVector)
Q1 <- quantile(abaloneModifiedVector, probs=c(.25, .75))[1]
Q3 <- quantile(abaloneModifiedVector, probs=c(.25, .75))[2]
outlierDown = Q1 - IQR
outlierUp = IQR + Q3
outliersNo <- sum(abaloneModifiedVector > outlierUp) + sum(abaloneModifiedVector < outlierDown)
if (sum(abaloneModifiedVector > outlierUp) > sum(abaloneModifiedVector < outlierDown)) {
outlier <- max(outlierUp * 1.15, na.rm = TRUE)
} else {
outlier <- min(outlierDown * 0.9, na.rm = TRUE)
}
abaloneBoxPlot <- abaloneModifiedFun %>%
ggplot(aes_string(x = column)) +
geom_boxplot(color="gray", fill=color, alpha= 0.4, outlier.size = 2, outlier.color = color) +
theme_minimal() +
scale_x_continuous(limits=c(0,xAxisMax)) +
labs(subtitle = sexName) +
theme(
plot.subtitle=element_text(size = 10, face = 'italic', hjust=0.5, color = 'black'),
text = element_text(size = 8),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
if(annotateFlag) {
abaloneBoxPlot <- abaloneBoxPlot +
annotate("text", x = outlierUp, y = -0.175,
label = paste0(outliersNo, " outliers"),
vjust = 1, size = 3, color = "black") +
annotate("text", x = outlierUp, y = +0.35,
label = paste0 ('observations: ',nrow(abaloneModifiedFun)),
vjust = 1, size = 3, color = "black") +
annotate("curve", x = outlierUp, y = -0.15,
xend = outlier, yend = -0.02,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
color = "grey")}
if(xLabelsFlag == FALSE) {
abaloneBoxPlot <- abaloneBoxPlot +
theme(axis.title.x = element_blank())
}
abaloneBoxPlot
}
### HISTOGRAM FUNCTION ###
funHist <- function(column, color, sexVector, scale, min=0, max=0) {
xVariable <- abaloneModified %>%
filter(sex %in% sexVector) %>%
select(!!column) %>%
as.vector()
abaloneHist <- abaloneModified %>%
filter(sex %in% sexVector) %>%
ggplot(aes_string(column)) +
geom_density(color='white', fill=color, alpha = 0.4) +
theme_minimal() +
labs(subtitle = scale) +
theme(
plot.subtitle=element_text(size = 10, face = 'italic', hjust=0.5, color = 'black'),
text = element_text(size = 8),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank()) +
stat_function(fun = dnorm, col = 'black', lwd = 0.5,
args = list(mean = mean(unlist(xVariable)),
sd = sd(unlist(xVariable))))
if (min != 0 || max != 0) {
abaloneHist <- abaloneHist +
scale_x_continuous(limits=c(min,max))
}
abaloneHist
}
### FILTERED DATA ###
abaloneModifiedFiltered <- abaloneModified %>%
mutate(outlier = 'outlier') %>%
mutate(outlier = case_when(sex == 'I' & age < 13 & age > 6 ~ 'not_outlier',
sex == 'M' & age < 18 & age > 7 ~ 'not_outlier',
sex == 'F' & age < 18 ~ 'not_outlier',
#sex == 'M' & rings == 6 & age == 7.5 & round(viscera_wt,3) == 0.116 ~ 'outlier',
FALSE ~ outlier
)) %>%
filter(outlier == 'not_outlier') %>%
filter(!(sex == 'M' & rings == 6 & age == 7.5 & round(viscera_wt,3) == 0.116))
### REGRESSIONS FUNCTION ###
funBayesPlot <- function (sexVector, model, title = FALSE) {
abaloneModifiedFiltered <- abaloneModifiedFiltered %>%
filter(sex %in% sexVector)
stan_model <- stan_glm(model, data = abaloneModifiedFiltered)
bayesPlot <- pp_check(stan_model, "dens_overlay") +
theme_minimal() +
ggtitle('') +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom")
if (title) {
bayesPlot <- bayesPlot +
ggtitle('Bayesian regression - model quality')
}
bayesPlot
}
funQQPlot <- function (sexVector, model, title = FALSE) {
abaloneModifiedFiltered <- abaloneModifiedFiltered %>%
filter(sex %in% sexVector)
lmReg <- lm(model, data = abaloneModifiedFiltered)
qqPlot <- autoplot(lmReg, which =2)
qqPlotFin <- qqPlot@plots[[1]] +
theme_minimal() +
ggtitle('') +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
if (title) {
qqPlotFin <- qqPlotFin +
ggtitle('frequentist regression - model quality')
}
qqPlotFin
}
funRegression <- function(xAxisVar, yAxisVar, color, model, sexVector) {
abaloneRegression <- abaloneModifiedFiltered %>%
filter(sex %in% sexVector)
lmReg <- lm(model, data = abaloneRegression)
rSquaredFrequentist <- format(summary(lmReg)$r.squared, digits = 3)
beta <- round(coef(lmReg)[1],2)
stan_model <- stan_glm(model, data = abaloneRegression)
ss_res <- var(residuals(stan_model))
ss_fit <- var(fitted(stan_model))
rSquaredBayes <- round(1 - (ss_res / (ss_res + ss_fit)),3)
#rSquaredBayes <- NULL
abaloneRegressionPlot <- abaloneRegression %>%
ggplot(aes_string(x = xAxisVar, y = yAxisVar)) +
geom_jitter(color = color, alpha = 0.2, size = 2) +
#scale_y_continuous(limits=c(0,)) +
theme_minimal() +
labs(caption = paste0(#'slope:', beta,
'r_squared (frequentist):', rSquaredFrequentist,
'\nr_squared (Bayes):', rSquaredBayes, '\n')) +
theme(text = element_text(size = 8),
plot.caption = element_text(size = 12, face = 'italic', hjust=0.0, color = 'black'))
abaloneRegressionPlot
}
funRegressionSimple <- function(xAxisVar, yAxisVar, color, model, sexVector) {
abaloneRegression <- abaloneModified %>%
filter(sex %in% sexVector)
lmReg <- lm(model, data = abaloneRegression)
rSquaredFrequentist <- format(summary(lmReg)$r.squared, digits = 3)
beta <- round(coef(lmReg)[1],2)
abaloneRegressionPlot <- abaloneRegression %>%
ggplot(aes_string(x = xAxisVar, y = yAxisVar)) +
geom_jitter(color = color, alpha = 0.2, size = 2) +
#scale_y_continuous(limits=c(0,)) +
theme_minimal() +
labs(caption = paste0(#'slope:', beta,
'r_squared (frequentist):', rSquaredFrequentist,'\n'
)) +
theme(text = element_text(size = 8),
plot.caption = element_text(size = 12, face = 'italic', hjust=0.0, color = 'black'))
abaloneRegressionPlot
}
III. Explanatory data analysis
III.1 Correlations
The below plot shows strong (above 80%) correlation between independent variables and moderate correlation (40%-60%) between dependent variable (age) and explanatory variables.
In practice strong correlation between explanatory variables often indicates lower and lower incremental increases of regression model predictive power after increasing number of variables describing response variable (age).
options(repr.plot.width=7, repr.plot.height=4)
abaloneCorrMatrix <- abalone %>%
select(age,whole_wt, viscera_wt, shucked_wt, shell_wt, height, length, diameter) %>%
cor()
abaloneCorrPlot <- ggcorrplot(abaloneCorrMatrix, hc.order =TRUE, type ="lower", lab = TRUE, tl.cex = 9, tl.srt = 45) +
labs(title = 'Fig2. Correlations') +
theme(plot.title=element_text(size = 12, face = 'italic', hjust=0.5, color = 'black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title = element_blank(),
text = element_text(size = 12))
abaloneCorrPlot
III.2 Age
Age variable seems not to be normally distributed. In practice it often means that there are some nonlinearities in the data, which make it more difficult to construct model that fits the data well. For the purpose of this analysis, in case initial regression model needs improvements, age would be modeled as log-normally distributed.
There are also some outliers in age variable. Small number of outliers means that they may be safely removed if this is to improve our model.
It can be observed that females and males share similar age characteristic. Unsurprisingly infants age is considerably smaller, even though age distribution between high-age-infants and low-age-adults overlap.