Skip to content
2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez
  • AI Chat
  • Code
  • Report
  • Spinner

    2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez

    Let's look into Datacamp's Workspaces.

    • They are like Jupyter notebooks... but better
    • You can load datasets and interact with them with python code directly or using AI.

    Let's look into a simple Exploratory Data Analysis exercise

    Exercises

    Let's load in a Dataset with the help of AI:

    "Please read the green-businesses dataset using python"

    # Import the pandas library
    import pandas as pd
    
    # Read the dataset
    df = pd.read_csv('green-businesses.csv')
    
    # Display the first few rows of the dataset
    df.head()

    Now let's count the number of green businesses:

    "Please count the number of entries in the greeen businesses dataset"

    # Count the number of entries in the dataset
    count = len(df)
    count
    

    Now we can try something more complex. It helps to know some of the concepts we reviewed in the presentation.

    "Please create a line plot that shows the evolution of the number of entries grouped by year"

    # Import the necessary libraries
    import matplotlib.pyplot as plt
    
    # Group the data by year and count the number of entries
    entries_by_year = df.groupby('year').size()
    
    # Create a line plot
    plt.plot(entries_by_year.index, entries_by_year.values)
    plt.xlabel('Year')
    plt.ylabel('Number of Entries')
    plt.title('Evolution of Number of Entries by Year')
    plt.show()
    

    Another example. It is very useful to think about what meaningful groups we can create using the existing variables. In this case, the region variable.

    "Please create a horizontal bar plot of the number of entries in the dataset grouped by the region variable"

    # Import the necessary libraries
    import matplotlib.pyplot as plt
    
    # Group the data by region and count the number of entries
    entries_by_region = df.groupby('region').size()
    
    # Sort the entries by region in descending order
    entries_by_region = entries_by_region.sort_values(ascending=False)
    
    # Create a horizontal bar plot
    plt.barh(entries_by_region.index, entries_by_region.values)
    plt.xlabel('Number of Entries')
    plt.ylabel('Region')
    plt.title('Number of Entries by Region')
    plt.show()

    Finally, don't hesitate to try more natural looking questions. This is where the power of language models shines:

    "Please tell me the most common type of green business according to the result variable."

    To find the most common type of green business according to the result variable, we can use the value_counts() function in pandas. Here's the code:

    # Find the most common type of green business
    most_common_type = df['result'].value_counts().idxmax()
    most_common_type