Applying Learnings
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    I'm using this workspace to condense and summarise my learnings on my way to get the first Data Analyst certification. I will use some existing data from my personal life to build connections between the Python functions and data analysis concepts I'm learning.

    To put the learnings into use I'm using some health data extracted from my smart watch.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.stats import uniform, binom
    import pandas as pd
    import statistics
    from scipy.stats import poisson
    import seaborn as sns
    
    #load the data into a variable
    df = pd.read_csv('hr_data.csv')
    
    #collection of functions I've used on my way to getting my first Data Analyst cerfitication
    #df.columns
    #pd.sort_values()
    #np.median()
    #statistics.mode()
    #plt.hist()
    #plt.axvline()
    #plt.show()
    #np.mean()
    #np.quantile()
    #np.random.seed()
    #df.loc[]
    #df.groupby()[].agg()
    #df.groupby()[].sum()
    #df[].value_counts()
    #df[].hist()
    #df.sample()
    #df.reset_index()
    #df.shape
    #unifiorm.cdf()
    #binom.pmf()
    #norm.ppf()
    #norm.rvs() - takes in the np.mean() and .std() with number of samples as 'size=0' and returns an array of samples on both sides of them mean.
    #poisson.pmf()
    #poisson.cdf()
    #poisson.rvs()
    #sns.lmplot()

    Upload data into the workspace and clean it up for use.

    Start by looking at the contents of the 'hr_data.csv'. For this I'll use the .columns to list out the column names of the file. .shape to understand the dimensions of the file, how many rows and columns are there in total. .describe() to see some summarised analysis of the numbers that are present in the data. eg. What's the mean, standard deviation, smallest number to largest number and what are the quantiles of the data. .info() to see what the datatypes are and .head() to view the first five rows of the dataframe.

    Some of the column names contain unnecessary words, remove them and update the column names.

    Convert the datetime fields to the correct data type with .astype(), in order to access .dt functions when using the dates in the analysis.

    #organising and formatting the data
    df.columns = [column.replace('heart_rate.','') for column in df.columns]
    df[['create_time','start_time','end_time']] = df[['create_time','start_time','end_time']].astype('datetime64')
    print(df.columns)
    print(df.shape)
    print(df.describe())
    print(df.info())
    print(df.head())

    On the content side of things, all look good. I did have to work the file locally a bit before uploading it to this workspace.

    I'll approach using the functions I learned about in the order that I learned about them. For this I'll have to come up with reasons to use them.

    With numpy's np.median() function we can the midpoint in a range of values. I'm going to store the 'heart_rate' column as a pandas.Series into a hrs_ordered, order the values from smallest to largest (ascending) and store the median into variable hr_med_ordered. Just to see how the results would differ I'll get the median of an underdered series as well and store it in hr_med_unordered. I'll print both of the results.

    I decided to also plot the values and understand how to add the median line into the plot.

    Conclusion

    • What can I say about my hear rate date based on the analysis below? It's difficult to tell anything qualitative based on the median hear rate. Median point, which means the midpoint of all values in the dataset, is 68. That means half of the observations are less than 68 and the other half more than 68.

    The distribution of the data is right skewed, or positively skewed.

    hr_median = np.median(df['heart_rate'])
    hr_mean = np.mean(df['heart_rate'])
    
    df['heart_rate'].hist(bins=10)
    plt.axvline(hr_median, color='r', linestyle='dashed')
    plt.axvline(hr_mean, color='y', linestyle='dashed')
    plt.legend(['median','mean','heart rate'])
    
    hr_mode = statistics.mode(df['heart_rate'])
    print(f'Most common hear rate measurement value: {hr_mode}')
    plt.show()

    Next I'll build on the previous results by grouping the data by a years and returning the median and mean for each year.

    For this I will have to use a several of fuctions. pd.groupby() I can use to make the groupings, but first I need to remind myself how to group by dates. For the date I'll use the 'create_time' column. Then I can use the .agg() function and pass it 'mean' and 'median' in a list.

    For the datetime groupings to work, I had to first make sure the columns were in datetime64 format. Them I was able to access the year on the 'create_time' series by using .dt.year.

    After displaying the results I'll plot them into a histogram as well using .hist() and plt.show().

    So far I've only been talking about functions, it's more important for me to discuss the concepts and how these functions relate to them than to keep iterating the same things explained already in the documentation.

    import seaborn as sns 
    
    df_grouped_byyear = df.groupby(df['create_time'].dt.year)['heart_rate'].agg(['median','mean']).round(0)
    print(df_grouped_byyear)
    
    labels = df_grouped_byyear.index
    x = np.arange(len(labels))
    width = 0.35
    
    fig, ax = plt.subplots()
    hr_median = ax.bar(x - width/2, df_grouped_byyear['median'], width, label='median')
    hr_mean = ax.bar(x + width/2, df_grouped_byyear['mean'], width, label='mean')
    
    ax.bar_label(hr_median, padding=3)
    ax.bar_label(hr_mean, padding=3)
    
    fig.tight_layout()
    plt.show()

    I should have now something to say about np.quantile and think of a way to use it on my data to extract something at least remotely meaningful.

    To do this I could create boxplots for each year and lookf for the outliars for the heart_rate in each year.

    Use the complete heart rate dateset to display quartiles for the hear rates with .linspace().

    #hr_quartiles = np.quantile()
    print(f'Minimum heart rate: {df["heart_rate"].min()}')
    print(f'Maximum heart rate: {df["heart_rate"].max()}')
    
    hr_quartiles = np.quantile(df['heart_rate'], np.linspace(0,1,5))
    print(f"Heart rate quartiles {hr_quartiles}")
    
    plt.boxplot(df['heart_rate'])
    plt.xlabel('year')
    plt.ylabel('hear rate')
    plt.title('Heart rate all time')
    plt.show()
    
    list_of_years = []
    
    fig, ax = plt.subplots()
    for year in df['create_time'].dt.year.unique():
        list_of_years.append(df.loc[df['create_time'].dt.year == year, 'heart_rate'])
    ax.boxplot(list_of_years)
    ax.set_xticklabels(df['create_time'].dt.year.unique())
    plt.xlabel('year')
    plt.ylabel('hear rate')
    plt.title('Heart rate by years')
    plt.show()