Analyze Your Runkeeper Fitness Data
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    1. Obtain and review raw data

    One day, my old running friend and I were chatting about our running styles, training habits, and achievements, when I suddenly realized that I could take an in-depth analytical look at my training. I have been using a popular GPS fitness tracker called Runkeeper for years and decided it was time to analyze my running data to see how I was doing.

    Since 2012, I've been using the Runkeeper app, and it's great. One key feature: its excellent data export. Anyone who has a smartphone can download the app and analyze their data like we will in this notebook.

    Runner in blue

    After logging your run, the first step is to export the data from Runkeeper (which I've done already). Then import the data and start exploring to find potential problems. After that, create data cleaning strategies to fix the issues. Finally, analyze and visualize the clean time-series data.

    I exported seven years worth of my training data, from 2012 through 2018. The data is a CSV file where each row is a single training activity. Let's load and inspect it.

    # Import pandas
    import pandas as pd
    
    # Define file containing dataset
    runkeeper_file = 'datasets/cardioActivities.csv'
    
    # Create DataFrame with parse_dates and index_col parameters 
    df_activities = pd.read_csv(runkeeper_file, parse_dates=['Date'], index_col='Date')
    
    # First look at exported data: select sample of 3 random rows 
    display(df_activities.sample(n=3))
    
    # Print DataFrame summary
    df_activities.info()

    2. Data preprocessing

    Lucky for us, the column names Runkeeper provides are informative, and we don't need to rename any columns.

    But, we do notice missing values using the info() method. What are the reasons for these missing values? It depends. Some heart rate information is missing because I didn't always use a cardio sensor. In the case of the Notes column, it is an optional field that I sometimes left blank. Also, I only used the Route Name column once, and never used the Friend's Tagged column.

    We'll fill in missing values in the heart rate column to avoid misleading results later, but right now, our first data preprocessing steps will be to:

    • Remove columns not useful for our analysis.
    • Replace the "Other" activity type to "Unicycling" because that was always the "Other" activity.
    • Count missing values.
    # Define list of columns to be deleted
    cols_to_drop = ['Friend\'s Tagged','Route Name','GPX File','Activity Id','Calories Burned', 'Notes']
    
    # Delete unnecessary columns
    df_activities = df_activities.drop(columns=cols_to_drop)
    
    # Count types of training activities
    display(df_activities['Type'].value_counts())
    
    # Rename 'Other' type to 'Unicycling'
    df_activities['Type'] = df_activities['Type'].str.replace('Other', 'Unicycling')
    
    # Count missing values for each column
    df_activities.isnull().sum()

    3. Dealing with missing values

    As we can see from the last output, there are 214 missing entries for my average heart rate.

    We can't go back in time to get those data, but we can fill in the missing values with an average value. This process is called mean imputation. When imputing the mean to fill in missing data, we need to consider that the average heart rate varies for different activities (e.g., walking vs. running). We'll filter the DataFrames by activity type (Type) and calculate each activity's mean heart rate, then fill in the missing values with those means.

    # Calculate sample means for heart rate for each training activity type 
    avg_hr_run = df_activities[df_activities['Type'] == 'Running']['Average Heart Rate (bpm)'].mean()
    avg_hr_cycle = df_activities[df_activities['Type'] == 'Cycling']['Average Heart Rate (bpm)'].mean()
    
    # Split whole DataFrame into several, specific for different activities
    df_run = df_activities[df_activities['Type'] == 'Running'].copy()
    df_walk = df_activities[df_activities['Type'] == 'Walking'].copy()
    df_cycle = df_activities[df_activities['Type'] == 'Cycling'].copy()
    
    # Filling missing values with counted means  
    df_walk['Average Heart Rate (bpm)'].fillna(110, inplace=True)
    df_run['Average Heart Rate (bpm)'].fillna(int(avg_hr_run), inplace=True)
    df_cycle['Average Heart Rate (bpm)'].fillna(int(avg_hr_cycle), inplace=True)
    
    # Count missing values for each column in running data
    df_run.isnull().sum()

    4. Plot running data

    Now we can create our first plot! As we found earlier, most of the activities in my data were running (459 of them to be exact). There are only 29, 18, and two instances for cycling, walking, and unicycling, respectively. So for now, let's focus on plotting the different running metrics.

    An excellent first visualization is a figure with four subplots, one for each running metric (each numerical column). Each subplot will have a different y-axis, which is explained in each legend. The x-axis, Date, is shared among all subplots.

    %matplotlib inline
    
    # Import matplotlib, set style and ignore warning
    import matplotlib.pyplot as plt
    %matplotlib inline
    import warnings
    plt.style.use('ggplot')
    warnings.filterwarnings(
        action='ignore', module='matplotlib.figure', category=UserWarning,
        message=('This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.')
    )
    
    # Prepare data subsetting period from 2013 till 2018
    runs_subset_2013_2018 = df_run.loc['2018':'2013']
    
    # Create, plot and customize in one step
    runs_subset_2013_2018.plot(subplots=True,
                               sharex=False,
                               figsize=(12,16),
                               linestyle='none',
                               marker='o',
                               markersize=3,
                              )
    
    # Show plot
    plt.show()

    5. Running statistics

    No doubt, running helps people stay mentally and physically healthy and productive at any age. And it is great fun! When runners talk to each other about their hobby, we not only discuss our results, but we also discuss different training strategies.

    You'll know you're with a group of runners if you commonly hear questions like:

    • What is your average distance?
    • How fast do you run?
    • Do you measure your heart rate?
    • How often do you train?

    Let's find the answers to these questions in my data. If you look back at plots in Task 4, you can see the answer to, Do you measure your heart rate? Before 2015: no. To look at the averages, let's only use the data from 2015 through 2018.

    In pandas, the resample() method is similar to the groupby() method - with resample() you group by a specific time span. We'll use resample() to group the time series data by a sampling period and apply several methods to each sampling period. In our case, we'll resample annually and weekly.

    # Prepare running data for the last 4 years
    runs_subset_2015_2018 = df_run['2018':'2015']
    
    # Calculate annual statistics
    print('How my average run looks in last 4 years:')
    display(runs_subset_2015_2018.resample('A').mean())
    
    # Calculate weekly statistics
    print('Weekly averages of last 4 years:')
    display(runs_subset_2015_2018.resample('W').mean().mean())
    
    # Mean weekly counts
    weekly_counts_average = runs_subset_2015_2018['Distance (km)'].resample('W').count().mean()
    print('How many trainings per week I had on average:', weekly_counts_average)

    6. Visualization with averages

    Let's plot the long term averages of my distance run and my heart rate with their raw data to visually compare the averages to each training session. Again, we'll use the data from 2015 through 2018.

    In this task, we will use matplotlib functionality for plot creation and customization.

    # Prepare data
    runs_subset_2015_2018 = df_run['2018':'2015']
    runs_distance = runs_subset_2015_2018['Distance (km)']
    runs_hr = runs_subset_2015_2018['Average Heart Rate (bpm)']
    
    # Create plot
    fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,8))
    
    # Plot and customize first subplot
    runs_distance.plot(ax=ax1)
    ax1.set(ylabel='Distance (km)', title='Historical data with averages')
    ax1.axhline(runs_distance.mean(), color='blue', linewidth=1, linestyle='-.')
    
    # Plot and customize second subplot
    runs_hr.plot(ax=ax2, color='gray')
    ax2.set(xlabel='Date', ylabel='Average Heart Rate (bpm)')
    ax2.axhline(runs_hr.mean(), color='blue', linewidth=1, linestyle='-.')
    
    # Show plot
    plt.show()

    7. Did I reach my goals?

    To motivate myself to run regularly, I set a target goal of running 1000 km per year. Let's visualize my annual running distance (km) from 2013 through 2018 to see if I reached my goal each year. Only stars in the green region indicate success.

    # Prepare data
    df_run_dist_annual = df_run.loc['2018':'2013', 'Distance (km)'].resample('A').sum()
    
    # Create plot
    fig = plt.figure(figsize=(8,5))
    
    # Plot and customize
    ax = df_run_dist_annual.plot(marker='*', markersize=14, linewidth=0, color='blue')
    ax.set(ylim=[0, 1210], 
           xlim=['2012','2019'],
           ylabel='Distance (km)',
           xlabel='Years',
           title='Annual totals for distance')
    
    ax.axhspan(1000, 1210, color='green', alpha=0.4)
    ax.axhspan(800, 1000, color='yellow', alpha=0.3)
    ax.axhspan(0, 800, color='red', alpha=0.2)
    
    # Show plot
    plt.show()

    8. Am I progressing?

    Let's dive a little deeper into the data to answer a tricky question: am I progressing in terms of my running skills?

    To answer this question, we'll decompose my weekly distance run and visually compare it to the raw data. A red trend line will represent the weekly distance run.

    We are going to use statsmodels library to decompose the weekly trend.