generalising concrete composition and strength

Can you predict the strength of concrete?

📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

💾 The data

The team has already tested more than a thousand samples (source):

Compressive strength data:

"cement" - Portland cement in kg/m3
"slag" - Blast furnace slag in kg/m3
"fly_ash" - Fly ash in kg/m3
"water" - Water in liters/m3
"superplasticizer" - Superplasticizer additive in kg/m3
"coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
"fine_aggregate" - Fine aggregate (sand) in kg/m3
"age" - Age of the sample in days
"strength" - Concrete compressive strength in megapascals (MPa)

Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

Executive summary

We have been instructed to analyze data on precise days, i.e. 1,7,14 and 28 day to find a simple way to estimate strength with respect to features.

#libraries
# suppress messages
suppressPackageStartupMessages(
{
    library(tidyverse)
    library(broom)
    }
 )

Load data

# load data 
df <- readr::read_csv('data/concrete_data.csv', show_col_types = FALSE)
head(df,3)
tail(df,3)

1D EDA

Dimensions

# dim df and proportion of NA's.
list(dim = dim(df),

# prop na's
prop_na = mean(is.na(df))
          )

There is no NA in the data.

# segment data by age.
df <- df %>%
    filter(age %in% c(1,7,14,28)) %>%
    group_by(age)

How many samples are there recorded on each day ?

# count recorded samples on different selected days
df %>% 
    count(age)

Age has different number of samples recorded during given days.

2 samples on day 1
126 samples on day 7
62 sample on day 14
425 samples on day 28

What is the mean strength on days 1,7,14 and 28 ?

# mean strength on given days 
mean_strength_df <- df %>%
    summarize( mean_strength = mean(strength, na.rm = TRUE))
mean_strength_df

‌
‌
‌