Table of Contents
The outline of your notebook will show up here. You can include headings in any text cell by starting a line with #
, ##
, ###
, etc., depending on the desired title hierarchy.
What is Concrete?
Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.Concrete has been used since the time of the ancient Romans and as gone through several modifications through the decade.Such modifications come about from statisitcal analysis on the mix ratio and resulting concrete strength. In the notebook, we are going to anaylize the experimental results of thousands of samples of concrete with the aim of developing a model that can predict the strength of concrete by inputing the obtained coeficients.
Lets start by taking a look at our dataframe
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
conc = pd.read_csv('data/concrete_data.csv')
display(conc)
The dataframe reveals the mixture of cement, slah, fly ash,water,superplasticizer,coarse_aggregate,fine aggregate in different proportions and tested at differnt days recorded in the age colunm to obtain the strength recorded in the strength colunm.
Checking for Data frame issues?
let us check for data frame issues like missing values or duplicated values and perform a data cleaning exercise if necessary
conc.info()
conc.describe()
conc[conc.duplicated()]
conc.drop_duplicates(inplace=True)
conc.shape
The dataframe had some duplicated rows which have been removed and have now reduced the unqiue row numbers from 1030 to 1005.
What gives Concrete its strength?
There is a nunmber of different combination of variables that leads to the strength of concrete. There is an old saying "Age like fine wine" which translates to the older you get, the better you become, lets visualize the concrete strength as agaisnt the age to put the theory to test.
conc_age_group=conc.groupby('age')['strength'].mean().round(2).reset_index()
display(conc_age_group)
figure, ax= plt.subplots()
sns.regplot(data=conc_age_group,x='age',y='strength',order=2,ci=0)
plt.show()
The strength of the concrete increased averagely from day 1 to day 365 but the strength gain flattend from day 56 and in some samples reduced.The reduction in the strength migth be due to other factors which we would find out later.
Age Distribution of The concrete samples
The concrete samples were tested at different ages ranging from 1-365 days, lets visualizes the age distribution of the concrete samples.count=(conc.groupby('age')['strength']\
.agg(['mean','count'])).round(2).reset_index()\
.rename(columns={'mean':'Average Strength'})
display(count.sort_values(by='count',ascending=False))
figure, ax = plt.subplots()
sns.countplot(x='age',data=conc);
plt.title(label="Distrubution of concrete age groups")
plt.xlabel("Age (Days)")
plt.ylabel("Count of samples")
plt.show()
From the table and graph above, we can see most of the samples are between the 1-100 days age group, with the strength increasing as the concrete gets older.
Base Regression Model
Let us define our target(Y) variables and features(X) and train our first regression model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X=conc.iloc[:,0:-1]
y=pd.DataFrame(conc.iloc[:,-1])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42,stratify=X.age)
reg=LinearRegression()
reg.fit(X_train,y_train)
y_pred=reg.predict(X_test)
print("The accuracy of the model(r^2) is : ",reg.score(X_test,y_test).round(2))
coef=pd.DataFrame({'Materials':list(X.columns),'Coef':reg.coef_.flatten()})
display(coef.style.background_gradient(cmap="PRGn"))
fig,ax=plt.subplots()
# plt.scatter(y_test,y_pred,alpha=0.7, edgecolors="k")
sns.barplot(data=coef,x='Materials',y='Coef')
plt.xticks(rotation=45)
plt.show()
The model has an r squared value of 0.62 meaning our model can onlt explain about 62% of the variability in the dataset. The superplasticizer has the highest weight of the features and water having the lowest weight of the feautrues.
# Import the necessary modules
from sklearn.model_selection import cross_val_score, KFold
# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)
reg = LinearRegression()
# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)
# Print scores
print(cv_scores.mean().round(2))
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
# Create a Ridge regression model
ridge = Ridge(alpha=alpha)
# Fit the data
ridge.fit(X_train, y_train)
# Obtain R-squared
score = ridge.score(X_test, y_test)
ridge_scores.append(score)
print(ridge_scores)
Regression Model with scaled features
Let us scale our feautures to even the playing field for all feautrues
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
reg=LinearRegression()
reg.fit(X_train_scaled,y_train)
y_pred=reg.predict(X_test_scaled)
print("The accuracy of the model(r^2) is : ",reg.score(X_test_scaled,y_test).round(2))
coef=pd.DataFrame({'Materials':list(X.columns),'Coef':reg.coef_.flatten()})
display(coef.style.background_gradient(cmap="PRGn"))
fig,ax=plt.subplots()
# plt.scatter(y_test,y_pred,alpha=0.7, edgecolors="k")
sns.barplot(data=coef,x='Materials',y='Coef')
plt.xticks(rotation=45)
plt.show()
There is no difference between the accuracy of the scaled features and that of the unscaled features.
Regression Model with encoded Age Feautrues
The age category of the model can be represented as a categorical feauture using one hot encoding representing the values with onces and zeroes
x_dummies=pd.get_dummies(X['age'],drop_first=True)
X_dummies=X.drop(columns='age',axis=1)
X_dummies=pd.concat([X_dummies,x_dummies],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X_dummies,y,test_size=0.2,random_state=42)
reg.fit(X_train,y_train)
y_pred=reg.predict(X_test)
print("The accuracy of the model(r^2) is : ",reg.score(X_test,y_test).round(2))
coef=pd.DataFrame({'Materials':list(X_train.columns),'Coef':reg.coef_.flatten()})
display(coef.style.background_gradient(cmap="PRGn"))
fig,ax=plt.subplots()
plt.scatter(y_test,y_pred,alpha=0.7, color="black")
plt.plot(y_pred,y_pred,color='red')
# sns.barplot(data=coef,x='Materials',y='Coef')
# plt.xticks(rotation=45)
plt.show()
The accuracy of the model has improved significantly to 82%. From our earlier analytics we deduced that the significant ages of the sample is bewteen 1-150, with samples falling outside of this age range having very few representation in the sample collection and also might have a negative impact on our model.