Can you predict the strength of concrete?
📖 Background
You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.
Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.
The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.
The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.
💾 The data
The team has already tested more than a thousand samples (source):
Compressive strength data:
- "cement" - Portland cement in kg/m3
- "slag" - Blast furnace slag in kg/m3
- "fly_ash" - Fly ash in kg/m3
- "water" - Water in liters/m3
- "superplasticizer" - Superplasticizer additive in kg/m3
- "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
- "fine_aggregate" - Fine aggregate (sand) in kg/m3
- "age" - Age of the sample in days
- "strength" - Concrete compressive strength in megapascals (MPa)
Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).
Executive summary
We have been instructed to analyze data on precise days, i.e. 1,7,14 and 28 day to find a simple way to estimate strength with respect to features.
#libraries
# suppress messages
suppressPackageStartupMessages(
{
library(tidyverse)
library(broom)
}
)
Load data
# load data
df <- readr::read_csv('data/concrete_data.csv', show_col_types = FALSE)
head(df,3)
tail(df,3)
1D EDA
Dimensions
# dim df and proportion of NA's.
list(dim = dim(df),
# prop na's
prop_na = mean(is.na(df))
)
There is no NA in the data.
# segment data by age.
df <- df %>%
filter(age %in% c(1,7,14,28)) %>%
group_by(age)
How many samples are there recorded on each day ?
# count recorded samples on different selected days
df %>%
count(age)
Age has different number of samples recorded during given days.
- 2 samples on day 1
- 126 samples on day 7
- 62 sample on day 14
- 425 samples on day 28
What is the mean strength on days 1,7,14 and 28 ?
# mean strength on given days
mean_strength_df <- df %>%
summarize( mean_strength = mean(strength, na.rm = TRUE))
mean_strength_df
‌
‌