2023-09-08 | Data Storytelling for Absolute Beginners: A Case Study with Green Businesses | Camilo Martinez
Let's look into Datacamp's Workspaces.
- They are like Jupyter notebooks... but better
- You can load datasets and interact with them with python code directly or using AI.
Let's look into a simple Exploratory Data Analysis exercise
Exercises
Let's load in a Dataset with the help of AI:
"Please read the green-businesses dataset using python"
# Import the pandas library
import pandas as pd
# Read the dataset
df = pd.read_csv('green-businesses.csv')
# Display the first few rows of the dataset
df.head()
Now let's count the number of green businesses:
"Please count the number of entries in the greeen businesses dataset"
# Count the number of entries in the dataset
count = len(df)
count
Now we can try something more complex. It helps to know some of the concepts we reviewed in the presentation.
"Please create a line plot that shows the evolution of the number of entries grouped by year"
# Import the necessary libraries
import matplotlib.pyplot as plt
# Group the data by year and count the number of entries
entries_by_year = df.groupby('year').size()
# Create a line plot
plt.plot(entries_by_year.index, entries_by_year.values)
plt.xlabel('Year')
plt.ylabel('Number of Entries')
plt.title('Evolution of Number of Entries by Year')
plt.show()
Another example. It is very useful to think about what meaningful groups we can create using the existing variables. In this case, the region variable.
"Please create a horizontal bar plot of the number of entries in the dataset grouped by the region variable"
# Import the necessary libraries
import matplotlib.pyplot as plt
# Group the data by region and count the number of entries
entries_by_region = df.groupby('region').size()
# Sort the entries by region in descending order
entries_by_region = entries_by_region.sort_values(ascending=False)
# Create a horizontal bar plot
plt.barh(entries_by_region.index, entries_by_region.values)
plt.xlabel('Number of Entries')
plt.ylabel('Region')
plt.title('Number of Entries by Region')
plt.show()
Finally, don't hesitate to try more natural looking questions. This is where the power of language models shines:
"Please tell me the most common type of green business according to the result variable."
To find the most common type of green business according to the result
variable, we can use the value_counts()
function in pandas. Here's the code:
# Find the most common type of green business
most_common_type = df['result'].value_counts().idxmax()
most_common_type