Chocolate Bars Analysis
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    1. Chocolate bars and ratings

    In this notebook, we will do a comprehensive analysis of the Chocolate bars market by exploring the chocolate bars with the highest ratings. We'll also look for insights in the data for characteristics that could help narrow our search for suppliers (e.g., cocoa percentage, bean country of origin, etc.).

    Let's take a look at at the data, "chocolate_bars.csv" contains all the details of the chocolate bars. The features that describe a given bar are:

    • "id" - id number of the review
    • "manufacturer" - Name of the bar manufacturer
    • "company_location" - Location of the manufacturer
    • "year_reviewed" - From 2006 to 2021
    • "bean_origin" - Country of origin of the cacao beans
    • "bar_name" - Name of the chocolate bar
    • "cocoa_percent" - Cocoa content of the bar (%)
    • "num_ingredients" - Number of ingredients
    • "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
    • "review" - Summary of most memorable characteristics of the chocolate bar
    • "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

    Acknowledgments: Brady Brelinski, Manhattan Chocolate Society

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from IPython.display import clear_output
    
    %matplotlib inline
    # Read in the data
    chocolates = pd.read_csv("data/chocolate_bars.csv")
    
    # Print the total number of apps
    print('Total number of chocolates bars = ', chocolates.shape[0])
    
    # Have a look at the first 5 rows
    chocolates.head()
    # Check the information of the dataset, especially the columns we need
    chocolates.info()

    2. Exploring chocolate bean origin

    With more than 2000 chocolate bar ratings with origin from different countries around the world, we'll be checking the countries present and this brings us to the following questions:

    1. How many countries are present?
    2. Which country has the highest rating for chocolate bars?
    3. Which country have the least chocolate bar ratings?
    4. Is any specific country dominating the market?
    # Check the coutries present and how many they are
    display(f"Total number of countries = {len(chocolates.bean_origin.unique())}")
    
    print("The countries are as follows:\n", chocolates.bean_origin.unique())

    3. Distibution of chocolate bar ratings

    Firstly, let's see how all these chocolate bars perform on an average because ratings are a key quality indicator of any product. Then we'll plot a histogram to show the diistibution of the ratings

    # Distribution of chocolate bars according to their ratings
    chocolates.rating.hist(figsize=(10, 6))
    
    plt.xlabel("Ratings", fontsize=12)
    plt.ylabel("Number of Chocolate bars", fontsize=12)
    plt.title("Distribution of Chocolate Bars According to Their Ratings", fontsize=14)
    clear_output()
    
    # Print the average rating for all chocolate bars
    display(f"Average rating of all bars: {chocolates.rating.mean():.02f}")

    4. Relation between bean origin and chocolate bar rating

    From our analysis so far, we see that there are 62 unique countries present in our dataset and the average rating across all chocolate bars is 3.20. Also, the histogram plot is skewed to the left indicating that the majority of the chocolate bars are highly rated with only a few exceptions in the low-rated region. Now, we'll go ahead and answer some of our questions regarding average rating and country of bean origin

    # Create a dataframe of the countries and their average chocolate bar rating
    origin_rating = chocolates.groupby("bean_origin")[["rating"]].agg(np.mean).sort_values("rating", ascending=False)
    
    # Visualize country and average rating
    origin_rating.plot(kind="bar", figsize=(15, 8))
    plt.title("Average Chocolate Bar Rating for each Country", fontsize=14)
    clear_output()
    
    # Print the top 10 countries
    print(origin_rating.head(10))
    # Print the country with the highest average chocolate bar rating
    highest = origin_rating.head(1)
    print("Highest average rating =>\nCountry:", highest.index[0], "\nRating:", round(highest.rating[0], 2),"\n")
    
    # Print the country with the lowest average chocolate bar rating
    lowest = origin_rating.tail(1)
    print(f"Lowest average rating =>\nCountry: {lowest.index[0]}\nRating: {lowest.rating[0]:.2f}")

    5. How many bars in each country?

    Tobago is the country that has the highest average rating while Puerto Rico has the lowest average rating. We also notice that the margin of the ratings are not exactly wide enough to depict an extreme difference in the quality of the chocolate bars. This brings us to the question of "Which country is dominant in the chocolate bar market?", and we'll find out by checking how much chocolates bars originiates from each country.

    # Count the number of chocolate bars that originate from each country 
    bars_per_origin = chocolates.bean_origin.value_counts()
    
    # Print the top 10 countries in the bean market
    print(bars_per_origin.nlargest(10), "\n")
    
    # Print the bottom 12 bean producing countries
    print(bars_per_origin.tail(12))
    # Visualize the bars per coutry
    bars_per_origin.plot(kind="bar", figsize=(15, 8))
    plt.title("Number of Chocolate Beans from Each Country", fontsize=14)
    
    clear_output()

    6. Comparing the ratings of top bean producing countries

    Venezuela, Peru, Dominican Republic and Ecuador have the highest market prevalence, there are also other countries at the top with 100 or more chocolate bars originating from there. Interestingly, Tobago, the country with the highest average rating is not among the top 10 bean producing countries, it's infact the 12th least producing country with just 2 beans originating from there.

    Now, we'll be focusing on countries with 100 or more beans and with average ratings higher or close to the overall average rating.