Viktor Ivanenko
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
Sign up
Beta
Spinner

💪 Challenge

Create a report to summarize your research. Include:

  1. What is the average rating by country of origin?
  2. How many bars were reviewed for each of those countries?
  3. Create plots to visualize findings for questions 1 and 2.
  4. Is the cacao bean's origin an indicator of quality?
  5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?
  6. [Optional 2] Your research indicates that some consumers want to avoid bars with lecithin. Compare the average rating of bars with and without lecithin (L in the ingredients).
  7. Summarize your findings.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
Image(filename='pics/chocopoop.jpg', width=600 , height=600 )

💾 The data

Your team created a file with the following information (source):
  • "id" - id number of the review
  • "manufacturer" - Name of the bar manufacturer
  • "company_location" - Location of the manufacturer
  • "year_reviewed" - From 2006 to 2021
  • "bean_origin" - Country of origin of the cacao beans
  • "bar_name" - Name of the chocolate bar
  • "cocoa_percent" - Cocoa content of the bar (%)
  • "num_ingredients" - Number of ingredients
  • "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
  • "review" - Summary of most memorable characteristics of the chocolate bar
  • "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

Acknowledgments: Brady Brelinski, Manhattan Chocolate Society

df=pd.read_csv('data/chocolate_bars.csv')
df.head()
display(df.info(), df.isnull().sum())

1. What is the average rating by country of origin?

df_avg = df[['company_location', 'rating']].groupby('company_location').mean()
df_avg.rename(columns = {'rating':'avg. rating'}, inplace = True)

from IPython.display import display_html 

df1_styler = df_avg[:17].style.set_table_attributes("style='display:inline'").set_caption('df1')
df2_styler = df_avg[17:34].style.set_table_attributes("style='display:inline'").set_caption('df2')
df3_styler = df_avg[34:51].style.set_table_attributes("style='display:inline'").set_caption('df3')
df4_styler = df_avg[51:].style.set_table_attributes("style='display:inline'").set_caption('df4')


display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)

2. How many bars were reviewed for each of those countries?

df_count = df[['company_location', 'review']].groupby('company_location').count()
df_count.rename(columns = {'review':'reviews'}, inplace = True)

df1_styler = df_count[:17].style.set_table_attributes("style='display:inline'").set_caption('df1')
df2_styler = df_count[17:34].style.set_table_attributes("style='display:inline'").set_caption('df2')
df3_styler = df_count[34:51].style.set_table_attributes("style='display:inline'").set_caption('df3')
df4_styler = df_count[51:].style.set_table_attributes("style='display:inline'").set_caption('df4')

display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_(), raw=True)

3. Create plots to visualize findings for questions 1 and 2.

df_avg = df_avg.reset_index()
sns.set(font_scale=1.2)
ax = plt.subplots(figsize=(10, 15))
ax = sns.barplot(data=df_avg, y='company_location', x='avg. rating', palette='flare')
plt.title('Average rating by Country origin')
plt.xlabel('Rating')
plt.ylabel('Country of origin')
plt.xlim(2.5, 3.8)
plt.show()
df_count = df_count.reset_index()
sns.set(font_scale=1.0)
ax = plt.subplots(figsize=(10, 12))
ax = sns.barplot(data=df_count, y='company_location', x='reviews', palette='winter')
plt.title('Amount of Reviews by Country of origin')
plt.xlabel('Amount of Reviews')
plt.ylabel('Country of origin')
plt.show()

4. Is the cacao bean's origin an indicator of quality?

df_or = df[['bean_origin', 'rating']].groupby('bean_origin').mean().reset_index().sort_values('rating', ascending=False)
sns.set(font_scale=1.2)
ax = plt.subplots(figsize=(10, 15))
ax = sns.barplot(data=df_or, y='bean_origin', x='rating', palette='flare')
plt.title('Rating by Beans origin')
plt.xlabel('Rating')
plt.ylabel('Country of Beans origin')
plt.xlim(2.6, 3.7)
plt.show()

5. [Optional] How does cocoa content relate to rating? What is the average cocoa content for bars with higher ratings (above 3.5)?

‌
‌
‌
  • AI Chat
  • Code