Running Machine Learning Experiments in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Code-along 2023-11-14 Running Machine Learning Experiments in Python

    As the climate changes, predicting the weather becomes ever more important for businesses. Since the weather depends on a lot of different factors, we want to run a lot of experiments to determine what the best approach is to predict the weather.

    • In this project, we will use London Weather data sourced from Kaggle to try and predict the temperature.
    • The focus of this code-along is running machine learning experiments. We will first do some minor exploratory data analysis, and then use MLflow to run experiments on what models and what hyperparameters to use.
    • This is interesting for those of you that have already trained a machine learning model before and want to see how you can speed up the process of finding the best model.

    Import libraries

    First, we'll import necessary libraries, including MLflow.

    MLflow is an open-source platform designed to help manage the end-to-end machine learning lifecycle. It provides a comprehensive set of tools and features to streamline the process of building, training, and deploying machine learning models. Today, we'll be using MLflow for tracking experiments, hyperparameter tuning, model performance evaluation, and comparison and analysis of multiple models.

    To use MLflow, we first need to install the package, since it's not included in the workspace by default. Using the !, we can run a bash command to install it.

    !pip install mlflow
    Hidden output

    After the installation, we can import all the libraries, including MLflow.

    • pandas to import and read, and edit the data
    • numpy is used for calculations.
    • MLflow is the library we'll use for structuring our machine learning experiments
    • seaborn is used for visualizations
    • sklearn, scikit-learn is used for machine learning, with functions such as data preprocessing, model training and prediction
    import pandas as pd
    import numpy as np
    import mlflow
    import mlflow.sklearn
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import r2_score
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor

    Load data

    We will be working with data stored in london_weather.csv, which contains the following columns:

    • date - recorded date of measurement - (int)
    • cloud_cover - cloud cover measurement in oktas - (float)
    • sunshine - sunshine measurement in hours (hrs) - (float)
    • global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
    • max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
    • mean_temp - mean temperature in degrees Celsius (°C) - (float)
    • min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
    • precipitation - precipitation measurement in millimeters (mm) - (float)
    • pressure - pressure measurement in Pascals (Pa) - (float)
    • snow_depth - snow depth measurement in centimeters (cm) - (float)

    We'll load the dataset using the pandas read_csv function.

    # load dataset
    # ...
    # show first 5 rows
    # ... 
    # show info

    Exploratory Data Analysis

    Now that we have loaded the dataset, let's perform some exploratory data analysis to understand the data better. This includes handling missing values, feature engineering, and visualizing the data.

    • Use pandas pd.to_datetime function to adjust the type of the date column
    • Also add the year and month
    • Calculate the number of missing values using pandas isna()
    • Select the relevant weather columns
    • Groupby year and month and calculate the mean of the relevant metrics
    • Lineplot the mean temperature per month using seaborn
    • Barplot the mean sunshine per month
    • Visualize a heatmap to show the correlation of features using seaborn's heatmap() function and pandas .corr() function.
    # Converting 'date' column to datetime format
    # ...
    
    # Check missing values
    # ...
    # Grouping data by year and month, calculating mean of weather metrics
    # ...
    
    # Visualizing mean temperature
    # ...
    # Visualizing mean sunshine
    # ...
    # Visualizing heatmap of correlation
    # ...

    Process data into train and test sets

    Next, we have a function to preprocess the dataframe into train and test samples.

    • In this function we impute and scale the features. Imputing is done to fill in missing values, in this case using the mean, and scaling is used to put all features on the same scale, which often improves the performance.
    • Important to note is that we split the train and test set before we do imputation and scaling, such that there is no data leakage between train and test set.
    • Before running the function, we will also drop rows in which the temperature variable is unknown. Since we need to be able to train and test on the target variable at all times.