Workspace
@AlexJonesPhD Jones/

Bayes Rules - a probabilistic perspective on concrete strength!

0
Beta
Spinner

Can you predict the strength of concrete?

📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

💾 The data

The team has already tested more than a thousand samples (source):

Compressive strength data:
  • "cement" - Portland cement in kg/m3
  • "slag" - Blast furnace slag in kg/m3
  • "fly_ash" - Fly ash in kg/m3
  • "water" - Water in liters/m3
  • "superplasticizer" - Superplasticizer additive in kg/m3
  • "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
  • "fine_aggregate" - Fine aggregate (sand) in kg/m3
  • "age" - Age of the sample in days
  • "strength" - Concrete compressive strength in megapascals (MPa)

Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

💪 Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

  1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
  2. The coefficients , ... , to use in the following formula:

Starting Note

The bambi package is needed here to help build a Bayesian regression model that will give an informative, probabilistic estimate of compressive strength. !pip install bambi was used to acquire it, PyMC3, and other Bayesian workflow packages for model building and exploration.

%%capture
%pip install bambi

Step 0 - Imports

import re                       # Regular expressions
import arviz as az              # Bayesian plotting
import bambi as bmb             # Bayesian model building
import numpy as np              # numeric python
import pandas as pd             # dataframes
import seaborn as sns           # plots
import scipy.stats as st        # statistical functions
import matplotlib.pyplot as plt # Plotting 

# Set seed for reproducibility
np.random.seed(34)

Step 1 - Exploratory Data Analysis

First we check for missing values, examine the datatypes of the dataframe, and examine the distribution of the target variable - strength.

# Read in the data
df = pd.read_csv('data/concrete_data.csv')

# Any missing values?
display(df.isna().any())

# What are the data types?
display(df.dtypes)

# Plot the strength variable
sns.displot(df['strength'])

There are no missing values, which is really helpful. Moreover, the data is all numeric, and the strength variable looks approximately normally distributed. We can more or less proceed straight to building our Bayesian model!

Step 2 - Building a Bayesian model

The brief clearly lays out the foundations of a linear model for the relationship between strength and the other given variables. There are many approaches to linear models (and many other kinds of modelling strategies that could be used here), but the aim seems to be to estimate the β coefficients in the model and use those to predict the strengths of different ages of concrete.

Bayesian inference is a natural method here because it allows us to estimate these coefficients with uncertainty, and best of all - we get uncertainty in our estimates for free, which would be absolutely vital in civil engineering (in fact check out episode number 59 of the Learning Bayesian Statistics podcast!)

Bayesian inference does require a prior for each of the coefficients. There isn't any clear indication of these effects in the brief, so we will use very wide, general priors. First though, let us set up the model using bambi.

# Models are built in bambi using a formula string, much like statsmodels or R
# We lay out this model according to the brief!
modelspec = 'strength ~ cement + slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age'

However, unlike frequentist estimators, Bayesian inference is very picky about the scale of the data. The above model would cause the MCMC sampler to struggle given the hugely different scales. The solution is to scale the predictors. A z-score standardisation works particularly well here. In fact, bambi is clever enough to allow this in the model specification - we simply wrap each term in a scale function. We do this below, and build a model instance

# Set a scaled model
modelspec = """strength
               ~ scale(cement) + scale(slag) 
               + scale(fly_ash) + scale(water)
               + scale(superplasticizer) + scale(coarse_aggregate)
               + scale(fine_aggregate) + scale(age)
               """

# Set up a model - the priors are by default very weak, so there is no need to specify them explicitly here, though we will examine them
model = bmb.Model(modelspec, df)
model.build() # creates the model graph using theano

Step 3 - Examining the prior beliefs

With the model built (but not yet fitted), it is time to check the priors. First, let us examine a plot of the prior distribution of each of the coefficients - a good way to think of this is what the model thinks the coefficients could be before it sees the data.