Analyzing CO2 Emissions & Bicycle Market with Python and SQL [Competition]

This notebook focuses on solving problems with both Python and SQL skills in two seperate challenges. CO2 emissions will be analyzed using Python, and Bicycle Market with be analyzed using SQL Server.

This challenge was released by DataCamp for the 'Everyone Can Learn Python' Competition.

Used coding languages

Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best language used by data scientist for various data science projects/application. Python provide great functionality to deal with mathematics, statistics and scientific function.

SQL (Structured Query Language) is used for performing various operations on the data stored in the databases like updating records, deleting records, creating and modifying tables, views, etc. SQL is also the standard for the current big data platforms that use SQL as their key API for their relational databases.

1️⃣ CO2 Emissions Analysis

The CO2 emissions analysis will be performed with Python.

📖 Background

You volunteer for a public policy advocacy organization in Canada, and your colleague asked you to help her draft recommendations for guidelines on CO2 emissions rules.

After researching emissions data for a wide range of Canadian vehicles, she would like you to investigate which vehicles produce lower emissions.

Research Questions

Help your colleague gain insights on the type of vehicles that have lower CO2 emissions. Include:

What is the median engine size in liters?
What is the average fuel consumption for regular gasoline (Fuel Type = X), premium gasoline (Z), ethanol (E), and diesel (D)?
What is the correlation between fuel consumption and CO2 emissions?
Which vehicle class has lower average CO2 emissions, 'SUV - SMALL' or 'MID-SIZE'?
What are the average CO2 emissions for all vehicles? For vehicles with an engine size of 2.0 liters or smaller?
Any other insights you found during your analysis?

💾 Cars data

You have access to seven years of CO2 emissions data for Canadian vehicles (source):

Car characteristics

"Make" - The company that manufactures the vehicle.
"Model" - The vehicle's model.
"Vehicle Class" - Vehicle class by utility, capacity, and weight.
"Engine Size(L)" - The engine's displacement in liters.
"Cylinders" - The number of cylinders.
"Transmission" - The transmission type: A = Automatic, AM = Automatic Manual, AS = Automatic with select shift, AV = Continuously variable, M = Manual, 3 - 10 = the number of gears.
"Fuel Type" - The fuel type: X = Regular gasoline, Z = Premium gasoline, D = Diesel, E = Ethanol (E85), N = natural gas.
"Fuel Consumption Comb (L/100 km)" - Combined city/highway (55%/45%) fuel consumption in liters per 100 km (L/100 km).
"CO2 Emissions(g/km)" - The tailpipe carbon dioxide emissions in grams per kilometer for combined city and highway driving.

Acknowledgements

The data comes from the Government of Canada's open data website.

📈 Inspecting the data

Inspect data usability

We should inspect the data to make sure we can use it in its current form. We can learn more about the dataframe by getting a few stats, like a quick overview of the dataset and inspecting if there are any missing values.

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load the data
cars = pd.read_csv('data/co2_emissions_canada.csv')

# Preview the dataframe
cars

# Inspect data usability
cars.info()

# Summarizing the categorical data
cars.select_dtypes(include=object).describe()

# Summarizing the numerical data of type integer
cars.select_dtypes(include=int).describe()

#boxplots for Cylinders and CO2 Emissions
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.boxplot(data=cars, x="Cylinders", ax=axs[0])
sns.boxplot(data=cars, x="CO2 Emissions(g/km)", ax=axs[1])
f.tight_layout()

# Summarizing the categorical data of type float
cars.select_dtypes(include=float).describe()

#boxplots for Engine Size and Fuel Consumption
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.boxplot(data=cars, x="Engine Size(L)", ax=axs[0])
sns.boxplot(data=cars, x="Fuel Consumption Comb (L/100 km)", ax=axs[1])
f.tight_layout()

Conclusion

We can see that :

there is no missing data (7385 non-null entries in a total of 7385 entries),
we have 9 different categories of information (the column titles),
the dataframe contains both categorical and numerical data,
the dataframe contains 3 datatypes : object, float64 and int64,
all of the numerical data have some outliers, however they will not be discarded in the course of calculations.

We can further investigate each data type by applying the 'describe' function, as below.

🔎 Analysis

1/ What is the median engine size in liters?

We will calculate the median engine size by subsetting cars for engine size and calculating the median on this specific subset.

#calculate median engine size
median_engine_size = cars['Engine Size(L)'].median()
median_engine_size

‌
‌
‌