Beta
💪 Challenge
Create a report to summarize your research. Include:
- What is the average rating by country of origin?
- How many bars were reviewed for each of those countries?
- Create plots to visualize findings for questions 1 and 2.
- Is the cacao bean's origin an indicator of quality?
- [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
- [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
- Summarize your findings.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
Image(filename='pics/chocopoop.jpg', width=600 , height=600 )
💾 The data
Your team created a file with the following information (source):
- "id" - id number of the review
- "manufacturer" - Name of the bar manufacturer
- "company_location" - Location of the manufacturer
- "year_reviewed" - From 2006 to 2021
- "bean_origin" - Country of origin of the cacao beans
- "bar_name" - Name of the chocolate bar
- "cocoa_percent" - Cocoa content of the bar (%)
- "num_ingredients" - Number of ingredients
- "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
- "review" - Summary of most memorable characteristics of the chocolate bar
- "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding
Acknowledgments: Brady Brelinski, Manhattan Chocolate Society
df=pd.read_csv('data/chocolate_bars.csv')
df.head()
display(df.info(), df.isnull().sum())
1. What is the average rating by country of origin?
df_avg = df[['company_location', 'rating']].groupby('company_location').mean()
df_avg.rename(columns = {'rating':'avg. rating'}, inplace = True)
from IPython.display import display_html
df1_styler = df_avg[:17].style.set_table_attributes("style='display:inline'").set_caption('df1')
df2_styler = df_avg[17:34].style.set_table_attributes("style='display:inline'").set_caption('df2')
df3_styler = df_avg[34:51].style.set_table_attributes("style='display:inline'").set_caption('df3')
df4_styler = df_avg[51:].style.set_table_attributes("style='display:inline'").set_caption('df4')
display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)
2. How many bars were reviewed for each of those countries?
df_count = df[['company_location', 'review']].groupby('company_location').count()
df_count.rename(columns = {'review':'reviews'}, inplace = True)
df1_styler = df_count[:17].style.set_table_attributes("style='display:inline'").set_caption('df1')
df2_styler = df_count[17:34].style.set_table_attributes("style='display:inline'").set_caption('df2')
df3_styler = df_count[34:51].style.set_table_attributes("style='display:inline'").set_caption('df3')
df4_styler = df_count[51:].style.set_table_attributes("style='display:inline'").set_caption('df4')
display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)
3. Create plots to visualize findings for questions 1 and 2.
df_avg = df_avg.reset_index()
sns.set(font_scale=1.2)
ax = plt.subplots(figsize=(10, 15))
ax = sns.barplot(data=df_avg, y='company_location', x='avg. rating', palette='flare')
plt.title('Average rating by Country origin')
plt.xlabel('Rating')
plt.ylabel('Country of origin')
plt.xlim(2.5, 3.8)
plt.show()
df_count = df_count.reset_index()
sns.set(font_scale=1.0)
ax = plt.subplots(figsize=(10, 12))
ax = sns.barplot(data=df_count, y='company_location', x='reviews', palette='winter')
plt.title('Amount of Reviews by Country of origin')
plt.xlabel('Amount of Reviews')
plt.ylabel('Country of origin')
plt.show()
4. Is the cacao bean's origin an indicator of quality?
df_or = df[['bean_origin', 'rating']].groupby('bean_origin').mean().reset_index().sort_values('rating', ascending=False)
sns.set(font_scale=1.2)
ax = plt.subplots(figsize=(10, 15))
ax = sns.barplot(data=df_or, y='bean_origin', x='rating', palette='flare')
plt.title('Rating by Beans origin')
plt.xlabel('Rating')
plt.ylabel('Country of Beans origin')
plt.xlim(2.6, 3.7)
plt.show()
5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
‌
‌
‌
‌
‌