Workspace
@AlexJonesPhD Jones/

Bayes Rules - a probabilistic perspective on concrete strength!

0
Beta
Spinner

Can you predict the strength of concrete?

📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

💾 The data

The team has already tested more than a thousand samples (source):

Compressive strength data:
  • "cement" - Portland cement in kg/m3
  • "slag" - Blast furnace slag in kg/m3
  • "fly_ash" - Fly ash in kg/m3
  • "water" - Water in liters/m3
  • "superplasticizer" - Superplasticizer additive in kg/m3
  • "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
  • "fine_aggregate" - Fine aggregate (sand) in kg/m3
  • "age" - Age of the sample in days
  • "strength" - Concrete compressive strength in megapascals (MPa)

Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

💪 Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

  1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
  2. The coefficients , ... , to use in the following formula:

Starting Note

The bambi package is needed here to help build a Bayesian regression model that will give an informative, probabilistic estimate of compressive strength. !pip install bambi was used to acquire it, PyMC3, and other Bayesian workflow packages for model building and exploration.

%%capture
%pip install bambi

Step 0 - Imports

import re                       # Regular expressions
import arviz as az              # Bayesian plotting
import bambi as bmb             # Bayesian model building
import numpy as np              # numeric python
import pandas as pd             # dataframes
import seaborn as sns           # plots
import scipy.stats as st        # statistical functions
import matplotlib.pyplot as plt # Plotting 

# Set seed for reproducibility
np.random.seed(34)

Step 1 - Exploratory Data Analysis

First we check for missing values, examine the datatypes of the dataframe, and examine the distribution of the target variable - strength.

# Read in the data
df = pd.read_csv('data/concrete_data.csv')

# Any missing values?
display(df.isna().any())

# What are the data types?
display(df.dtypes)

# Plot the strength variable
sns.displot(df['strength'])

There are no missing values, which is really helpful. Moreover, the data is all numeric, and the strength variable looks approximately normally distributed. We can more or less proceed straight to building our Bayesian model!

Step 2 - Building a Bayesian model

The brief clearly lays out the foundations of a linear model for the relationship between strength and the other given variables. There are many approaches to linear models (and many other kinds of modelling strategies that could be used here), but the aim seems to be to estimate the β coefficients in the model and use those to predict the strengths of different ages of concrete.

Bayesian inference is a natural method here because it allows us to estimate these coefficients with uncertainty, and best of all - we get uncertainty in our estimates for free, which would be absolutely vital in civil engineering (in fact check out episode number 59 of the Learning Bayesian Statistics podcast!)

Bayesian inference does require a prior for each of the coefficients. There isn't any clear indication of these effects in the brief, so we will use very wide, general priors. First though, let us set up the model using bambi.

# Models are built in bambi using a formula string, much like statsmodels or R
# We lay out this model according to the brief!
modelspec = 'strength ~ cement + slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age'

However, unlike frequentist estimators, Bayesian inference is very picky about the scale of the data. The above model would cause the MCMC sampler to struggle given the hugely different scales. The solution is to scale the predictors. A z-score standardisation works particularly well here. In fact, bambi is clever enough to allow this in the model specification - we simply wrap each term in a scale function. We do this below, and build a model instance

# Set a scaled model
modelspec = """strength
               ~ scale(cement) + scale(slag) 
               + scale(fly_ash) + scale(water)
               + scale(superplasticizer) + scale(coarse_aggregate)
               + scale(fine_aggregate) + scale(age)
               """

# Set up a model - the priors are by default very weak, so there is no need to specify them explicitly here, though we will examine them
model = bmb.Model(modelspec, df)
model.build() # creates the model graph using theano

Step 3 - Examining the prior beliefs

With the model built (but not yet fitted), it is time to check the priors. First, let us examine a plot of the prior distribution of each of the coefficients - a good way to think of this is what the model thinks the coefficients could be before it sees the data.

# Show the priors
model.plot_priors(draws=10_000, hdi_prob=.95, kind='hist', bins=100);

This is great. It shows us that the most likely effects for the main predictors are hovering around zero, but that the effect could be as big or small as about ±80 strength units. Sometimes people accuse priors of 'influencing' the data; but this is clearly not happening much here as those effects are very broad. Next, we check a 'prior predictive' distribution, which will give us an idea of what the model thinks the strength variable itself looks like, based on

  • The observed data for our predictors, and
  • Our prior distributions for the coefficients

In simple terms - we can ask the model to make some predictions before its seen the data, just to get a sense of what it thinks.

# Build prior predictive
prpc = model.prior_predictive(draws=10_000)

# And pass this to the plot_ppc function in Arviz
az.plot_ppc(prpc, group='prior', num_pp_samples=1000, backend_kwargs={'figsize': (20, 10)})

OK, its pretty clear at this point our model gets the shape of the distribution correct, but its guesses are extremely spread out - it thinks concrete strength of -250 to 250 is plausible. Obviously that is not true, and we can sharpen things up by letting the model observe the strength data.

Step 4 - Fitting the model

Inspired by sklearn, we can fit a Bayesian model in bambi using MCMC with the .fit() method. We do this below and sample from the posterior distribution (which we can then use to predict on new data).

# Fit the model using the NUTS sampler in PyMC3
posterior = model.fit(draws=2000, tune=1000)

# Once fitted, let us examine the posterior distributions of the coefficients
az.plot_posterior(posterior, kind='hist',
                  color='black',
                  bins=100,
                  var_names='~sigma', 
                  filter_vars='like', 
                  hdi_prob=.95);

This contains all the information about the coefficients we may need, but we need to bear in mind scaling the predictor variables now means that - for exmaple - a 1 standard deviation increase in the cement variable is associated with a 13 point increase in strength; and we're 95% certain the effect is between 11 and 14 units. This is an issue for age in particular as we need to predict specific ages, and now we have age on the scale of the standard deviation. First, let us examine the predictions of the model, and see if it maps closely to the data. We do that using a posterior predictive check:

# Conduct a posterior predictive check
model.predict(posterior, kind='pps')

# And plot
az.plot_ppc(posterior, num_pp_samples=500, backend_kwargs={'figsize': (20, 10)})