Competition - Abalone Seafood Farming
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Can you estimate the age of an abalone?

    📖 Background

    You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.

    Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.

    💾 The data

    You have access to the following historical data (source):

    Abalone characteristics:
    • "sex" - M, F, and I (infant).
    • "length" - longest shell measurement.
    • "diameter" - perpendicular to the length.
    • "height" - measured with meat in the shell.
    • "whole_wt" - whole abalone weight.
    • "shucked_wt" - the weight of abalone meat.
    • "viscera_wt" - gut-weight.
    • "shell_wt" - the weight of the dried shell.
    • "rings" - number of rings in a shell cross-section.
    • "age" - the age of the abalone: the number of rings + 1.5.

    Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).

    💪 Competition challenge

    Create a report that covers the following:

    1. How does weight change with age for each of the three sex categories?
    2. Can you estimate an abalone's age using its physical characteristics?
    3. Investigate which variables are better predictors of age for abalones.

    ✅ Checklist before publishing

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
    • Check that all the cells run without error.
    suppressPackageStartupMessages(library(tidyverse))
    abalone <- readr::read_csv('data/abalone.csv', show_col_types = FALSE)
    install.packages("ggridges")
    install.packages("ggcorrplot")
    install.packages("grid")
    install.packages("gridExtra")
    install.packages("rstanarm")
    install.packages("ggfortify")
    library(ggridges)
    library(ggcorrplot)
    library(grid)
    library(gridExtra)
    library(rstanarm)
    library(ggfortify)

    I. Introduction

    I.1 Context

    For operational and environmental reasons, it is an important consideration for Japanese farming business entities to estimate the age of the abalones when they go to market. Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. This method however is cumbersome and complex.

    I.2 Analysis purpose

    The purpose of this analysis is to find less complex and less time consuming way of estimating the age of abalones using their physical characteristics instead of cumbersome rings counting.

    I.3 Questions to analyze

    • How does weight change with age for each of the three sex categories?
    • Can you estimate an abalone's age using its physical characteristics?
    • Which variables are better predictors of age for abalones.

    II. Dataset

    II.1 Dataset overview

    Dataset provided contains observations with abalones physical characteristics.

    Below table presents a glimpse at the dataset.

    Abalone characteristics - variables used in the analysis::

    • "sex" - M, F, and I (infant).
    • "length" - longest shell measurement
    • "diameter" - perpendicular to the length
    • "height" - measured with meat in the shell
    • "whole_wt" - whole abalone weight
    • "shucked_wt" - the weight of abalone meat
    • "viscera_wt" - gut-weight
    • "shell_wt" - the weight of the dried shell
    • "rings" - number of rings in a shell cross-section
    • "age" - the age of the abalone: the number of rings + 1.5

    Terminology note:

    This analysis will focus on applying regression methods in order to describe abalones age using physical characteristics provided in the dataset. Therefore, for the purpose of the analysis age is to be considered a response / dependent variable and all other features to be explanatory / independent variables.

    Dataset content:

    • 4,177 rows - observations of distinct abalones
    • 10 columns - variables describing each observation
    • numeric types of variables dominate
    'Table.1 Rows and columns (before cleaning)'
    glimpse(abalone)

    II.2 Summary statistics

    Below table present summary statistics for each variable.

    General dataset characteristics:

    • no missing data
    • maximum values for different characteristics significantly higher than 3rd quartile estimates indicate potential problem with outliers (to be investigated)
    • zero values in height variable are to be removed

    Table.1 Summary statistics

    abaloneSummary <- as.data.frame(summary(abalone))
    abaloneSummary$Var2 <- gsub("\\s+", "", abaloneSummary$Var2)
    abaloneSummary <- abaloneSummary %>%
      select(Var2, Freq) %>%
      replace(is.na(.), '') %>%
      mutate(time = rep(1:6,10)) %>%
      spread(Var2,Freq) %>%
      select(-time)  %>%
      relocate('sex')
    
    abaloneSummary
    

    II.3 Scaling

    When values of abalone characteristics would be on different scales, it would be a good idea to scale the data. Nevertheless, looking at distributions of independent variables, they seem to share similar scale, so scaling will not be performed for the purpose of this analysis.

    Hidden code
    ### DATASET MODIFICATIONS ###
    
    abaloneModified <- abalone %>%
      filter(length > 0,
             diameter > 0,
             height > 0, 
             whole_wt > 0,
             shucked_wt > 0,
             viscera_wt > 0,
             shell_wt > 0,
             age > 0) %>%
      mutate(lengthLog = log(length),
             diameterLog = log(diameter),
             heightLog = log(height),
             whole_wtLog = log(whole_wt),
             shucked_wtLog = log(shucked_wt),
             viscera_wtLog = log(viscera_wt),
             shell_wtLog = log(shell_wt),
             ageLog = log(age),
             lengthSqrt = sqrt(length),
             diameterSqrt = sqrt(diameter),
             heightSqrt = sqrt(height),
             whole_wtSqrt = sqrt(whole_wt),
             shucked_wtSqrt = sqrt(shucked_wt),
             viscera_wtSqrt = sqrt(viscera_wt),
             shell_wtSqrt = sqrt(shell_wt),
             ageSqrt = sqrt(age),
             lengthExp = exp(length),
             diameterExp = exp(diameter),
             heightExp = exp(height),
             whole_wtExp = exp(whole_wt),
             shucked_wtExp = exp(shucked_wt),
             viscera_wtExp = exp(viscera_wt),
             shell_wtExp = exp(shell_wt),
             ageExp = exp(age))
    
    ### BOX PLOT FUNCTION ###
    
    funBox <- function(column, 
                       color, 
                       sexVector = c('I','F','M'), 
                       annotateFlag = FALSE, 
                       sexName, xAxisMax, xLabelsFlag) {
      
      abaloneModifiedFun <- abaloneModified %>%
        filter(sex %in% sexVector)
      
      abaloneModifiedVector <- abaloneModifiedFun %>%
        select(!!column) %>%
        as.matrix()
      
      IQR = 1.5 * IQR(abaloneModifiedVector)
      Q1 <- quantile(abaloneModifiedVector, probs=c(.25, .75))[1]
      Q3 <- quantile(abaloneModifiedVector, probs=c(.25, .75))[2]
      outlierDown =  Q1 - IQR
      outlierUp = IQR + Q3
      outliersNo <- sum(abaloneModifiedVector > outlierUp) + sum(abaloneModifiedVector < outlierDown)
      
      if (sum(abaloneModifiedVector > outlierUp) > sum(abaloneModifiedVector < outlierDown)) {
        outlier <- max(outlierUp * 1.15, na.rm = TRUE)
      } else {
        outlier <- min(outlierDown * 0.9, na.rm = TRUE)
      }
      
      abaloneBoxPlot <- abaloneModifiedFun %>%
        ggplot(aes_string(x = column)) +
        geom_boxplot(color="gray", fill=color, alpha= 0.4, outlier.size = 2, outlier.color = color) +
        theme_minimal() +
        scale_x_continuous(limits=c(0,xAxisMax)) +
        labs(subtitle = sexName) +
        theme(
          plot.subtitle=element_text(size = 10, face = 'italic', hjust=0.5, color = 'black'),
          text = element_text(size = 8),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          axis.title.y = element_blank()) 
      
      if(annotateFlag) {
        abaloneBoxPlot <- abaloneBoxPlot +
          annotate("text", x = outlierUp, y = -0.175,
                   label = paste0(outliersNo, " outliers"),
                   vjust = 1, size = 3, color = "black") +
          annotate("text", x = outlierUp, y = +0.35,
                   label = paste0 ('observations: ',nrow(abaloneModifiedFun)),
                   vjust = 1, size = 3, color = "black") +
          annotate("curve", x = outlierUp, y = -0.15,
                   xend = outlier, yend = -0.02,
                   arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
                   color = "grey")} 
      
      if(xLabelsFlag == FALSE) {
        abaloneBoxPlot <- abaloneBoxPlot +
          theme(axis.title.x = element_blank())
      }
      
      abaloneBoxPlot
      
    }
    
    ### HISTOGRAM FUNCTION ###
    
    funHist <- function(column, color,  sexVector, scale, min=0, max=0) {
      xVariable <- abaloneModified %>%
        filter(sex %in% sexVector) %>%
        select(!!column) %>%
        as.vector()
      
      abaloneHist <- abaloneModified %>%
        filter(sex %in% sexVector) %>%
        ggplot(aes_string(column)) +
        geom_density(color='white', fill=color, alpha = 0.4) +
        theme_minimal() +
        labs(subtitle = scale) +
        theme(
          plot.subtitle=element_text(size = 10, face = 'italic', hjust=0.5, color = 'black'),
          text = element_text(size = 8),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          axis.title.y = element_blank())  +
        stat_function(fun = dnorm,  col = 'black', lwd = 0.5,
                      args = list(mean = mean(unlist(xVariable)),
                                  sd = sd(unlist(xVariable))))
      if (min != 0 || max != 0) {
        abaloneHist <- abaloneHist +
          scale_x_continuous(limits=c(min,max))
      }
      
      abaloneHist
    }
    
    ### FILTERED DATA ###
    
    abaloneModifiedFiltered <- abaloneModified %>%
      mutate(outlier = 'outlier') %>%
      mutate(outlier = case_when(sex == 'I' & age < 13 & age > 6 ~ 'not_outlier',
                                 sex == 'M' & age < 18 & age > 7 ~ 'not_outlier',
                                 sex == 'F' & age < 18 ~ 'not_outlier',
                                 #sex == 'M' & rings == 6 & age == 7.5 & round(viscera_wt,3) == 0.116 ~ 'outlier',
                                 FALSE ~ outlier
                                 )) %>%
      filter(outlier == 'not_outlier') %>%
      filter(!(sex == 'M' & rings == 6 & age == 7.5 & round(viscera_wt,3) == 0.116))
    
    ### REGRESSIONS FUNCTION ###
    
    funBayesPlot <- function (sexVector, model, title = FALSE) {
      abaloneModifiedFiltered <- abaloneModifiedFiltered %>%
        filter(sex %in% sexVector)
      
      stan_model <- stan_glm(model, data = abaloneModifiedFiltered)
      bayesPlot <- pp_check(stan_model, "dens_overlay") + 
        theme_minimal()  +
        ggtitle('') +
        theme(plot.title = element_text(hjust = 0.5),
              legend.position = "bottom")
      
      if (title) {
        bayesPlot <- bayesPlot +
          ggtitle('Bayesian regression - model quality')
      }
      
      bayesPlot
    }
    
    funQQPlot <- function (sexVector, model, title = FALSE) {
      abaloneModifiedFiltered <- abaloneModifiedFiltered %>%
        filter(sex %in% sexVector)
      lmReg <- lm(model, data = abaloneModifiedFiltered)
      qqPlot <- autoplot(lmReg, which =2)
      qqPlotFin <- qqPlot@plots[[1]] +
        theme_minimal()   +
        ggtitle('') +
        theme(plot.title = element_text(hjust = 0.5),
              axis.text.x = element_blank(),
              axis.ticks.x = element_blank(),
              axis.title.x = element_blank())
      
      if (title) {
        qqPlotFin <- qqPlotFin +
          ggtitle('frequentist regression - model quality')
      }
      
      qqPlotFin
    }
    
    funRegression <- function(xAxisVar, yAxisVar, color, model, sexVector) {
      
      abaloneRegression <- abaloneModifiedFiltered %>% 
        filter(sex %in% sexVector)
      
      lmReg <- lm(model, data = abaloneRegression)
      rSquaredFrequentist <- format(summary(lmReg)$r.squared, digits = 3) 
      beta <- round(coef(lmReg)[1],2)
      
      stan_model <- stan_glm(model, data = abaloneRegression)
      ss_res <- var(residuals(stan_model))
      ss_fit <- var(fitted(stan_model))
      rSquaredBayes <- round(1 - (ss_res / (ss_res + ss_fit)),3)
      #rSquaredBayes <- NULL
      
      abaloneRegressionPlot <- abaloneRegression %>%
        ggplot(aes_string(x = xAxisVar, y = yAxisVar)) +
        geom_jitter(color = color, alpha = 0.2, size = 2) +
        #scale_y_continuous(limits=c(0,)) +
        theme_minimal() +
        labs(caption = paste0(#'slope:', beta,
                              'r_squared (frequentist):', rSquaredFrequentist,
                              '\nr_squared (Bayes):', rSquaredBayes, '\n')) +
        theme(text = element_text(size = 8),
              plot.caption = element_text(size = 12, face = 'italic', hjust=0.0, color = 'black')) 
      
      abaloneRegressionPlot
    }
    
    funRegressionSimple <- function(xAxisVar, yAxisVar, color, model, sexVector) {
      
      abaloneRegression <- abaloneModified %>% 
        filter(sex %in% sexVector)
      
      lmReg <- lm(model, data = abaloneRegression)
      rSquaredFrequentist <- format(summary(lmReg)$r.squared, digits = 3) 
      beta <- round(coef(lmReg)[1],2)  
      
      abaloneRegressionPlot <- abaloneRegression %>%
        ggplot(aes_string(x = xAxisVar, y = yAxisVar)) +
        geom_jitter(color = color, alpha = 0.2, size = 2) +
        #scale_y_continuous(limits=c(0,)) +
        theme_minimal() +
        labs(caption = paste0(#'slope:', beta,
                              'r_squared (frequentist):', rSquaredFrequentist,'\n'                        
        )) +
        theme(text = element_text(size = 8),
              plot.caption = element_text(size = 12, face = 'italic', hjust=0.0, color = 'black')) 
      
      abaloneRegressionPlot
    }

    III. Explanatory data analysis

    III.1 Correlations

    The below plot shows strong (above 80%) correlation between independent variables and moderate correlation (40%-60%) between dependent variable (age) and explanatory variables.

    In practice strong correlation between explanatory variables often indicates lower and lower incremental increases of regression model predictive power after increasing number of variables describing response variable (age).

    options(repr.plot.width=7, repr.plot.height=4)
    
    abaloneCorrMatrix <- abalone %>%
      select(age,whole_wt, viscera_wt, shucked_wt, shell_wt, height, length, diameter) %>%
      cor()
    
    abaloneCorrPlot <- ggcorrplot(abaloneCorrMatrix, hc.order =TRUE, type ="lower", lab = TRUE, tl.cex = 9, tl.srt = 45) +
      labs(title = 'Fig2. Correlations') +
      theme(plot.title=element_text(size = 12, face = 'italic', hjust=0.5, color = 'black'),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            axis.title = element_blank(),
            text = element_text(size = 12))
    
    abaloneCorrPlot

    III.2 Age

    Age variable seems not to be normally distributed. In practice it often means that there are some nonlinearities in the data, which make it more difficult to construct model that fits the data well. For the purpose of this analysis, in case initial regression model needs improvements, age would be modeled as log-normally distributed.

    There are also some outliers in age variable. Small number of outliers means that they may be safely removed if this is to improve our model.

    It can be observed that females and males share similar age characteristic. Unsurprisingly infants age is considerably smaller, even though age distribution between high-age-infants and low-age-adults overlap.