Tama Francisquez/

Duplicate of Sample Data Scientist Professional Solution


Data Scientist Professional

Example Practical Exam Solution

You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

Data Validation

This data set has 6738 rows, 9 columns. I have validated all variables and I have not made any changes after validation. All the columns are just as described in the data dictionary:

  • model: 18 models without missing values, same as the description. No cleaning is needed.
  • year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
  • price: numeric values without missing values, same as the description. No cleaning is needed.
  • transmission: 4 categories without missing values, same as the description. No cleaning is needed.
  • mileage: numeric values, same as the description. No cleaning is needed.
  • fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
  • mpg: numeric values without missing values, same as the description. No cleaning is needed.
  • engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.
# Use this cell to begin, and add as many cells as you need to complete your analysis!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import r2_score,mean_squared_error'ggplot')
df = pd.read_csv('data/toyota.csv')
#validate possible 18 values
#validate year of manufacture from 1998 to 2020
#validate four types of transmission
#validate four fuel Types
#validate 16 possible values in engineSize
#validate any negative values in numeric variables

Exploratory Analysis

I have investigated the target variable and features of the car, and the relationship between target variable and features. After the analysis,I decided to apply the following changes to enable modeling:

  • Price: use log transformation
  • Create a new ordinal variable from tax variable

Target Variable - Price

Since we need to predict the price, the price variable would be our target variable. From the histogram on the left below, we can see there is a longer right tail. Therefore, we apply log transforamtion of the price variable, the distribution on the right below is close to normal distribution.

fig, axes = plt.subplots(1,2,figsize=(15,5))
sns.histplot(df['price'],ax=axes[0]).set(title='The Distribution of Target Variable - Price')
sns.histplot(df['price'],log_scale=True,ax=axes[1]).set(title='The Distribution of Target Variable - Price (Log Scale)');
df['price'] = np.log(df['price'])

Numeric Variables - Mileage, Tax, mpg

From the heatmap below, we can conclude that there is a moderate linear negative relationship in two pairs of variables - price log transformation and mileage, tax and mpg.

numeric = df[['price','mileage','tax','mpg']]
sns.heatmap(numeric.corr(),annot=True).set(title='The Correlation Heatmap between Numeric Variables');
Relationship between mpg, tax, mileage and price

To spot the non-linear relationship, I decided to make scatterplot to further investigate the relationship between mpg, tax, mileage and our target varible - price. From the scatterplots below, there is linear relationship between mileage and price. No relationship between price and mpg. I found out there is clusters in the scatterplot between price and tax, so I decided to create a new ordinal variable from the tax variable.

fig, axes = plt.subplots(1,3,figsize=(15,5))
sns.scatterplot(y=df['price'],x=df['mpg'],ax=axes[0]).set(title='Price vs mpg')
sns.scatterplot(y=df['price'],x=df['tax'],ax=axes[1]).set(title='Price vs tax')
sns.scatterplot(y=df['price'],x=df['mileage'],ax=axes[2]).set(title='Price vs Mileage');
# Convert tax variable into an ordinal variable  
df.loc[(df['tax'] <= 100,'tax')] = 1
df.loc[((df['tax'] <= 200) & (df['tax'] > 100) ,'tax')] = 2
df.loc[((df['tax'] <= 300) & (df['tax'] > 200) ,'tax')] = 3
df.loc[(df['tax'] > 300 ,'tax')] = 4

Categorical Variables - Year, Engine Size, Model, Transmission, fuelType

Characteristics about Year, Engine size, Model, Transmission, and fuelType

Since year and engine size is most related to price, I checked their characteristics. From the bar chart below, the most common manufacture year is 2016. Also, the most common engine size in is 1.

fig, axes = plt.subplots(1,2,figsize=(15,5))
sns.countplot(x=df['year'], color='gray',ax=axes[0]).set(title='Count of Cars Sold in Manufacture Year')
sns.countplot(x=df['engineSize'],color='gray',ax=axes[1]).set(title='Count of Cars Sold in EngineSize')
axes[0].tick_params(axis='x', labelrotation=45)
axes[1].tick_params(axis='x', labelrotation=45);

From the bar charts below, we can see the most frequent categories in model, transmission and fuelType variables - Yaris, Manual, Petrol in the dataset.

fig, axes = plt.subplots(1,3,figsize=(15,5))
sns.countplot(x=df['model'],color='gray',ax=axes[0]).set(title='Count of Cars Sold in Model')
sns.countplot(x=df['transmission'],color='gray',ax=axes[1]).set(title='Count of Cars Sold in Transmission')
sns.countplot(x=df['fuelType'],color='gray',ax=axes[2]).set(title='Count of Cars Sold in Fuel Type')
for ax in fig.axes: