Competition - Chocolate Bars
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    ℹ️ Introduction to Data Science Notebooks

    You can skip this section if you are already familiar with data science notebooks.

    Data science notebooks

    A data science notebook is a document that contains text cells (what you're reading right now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells, and then run a cell by selecting it and then clicking the Run button above ( , or Run All ) or hitting control + enter.

    The result will be displayed directly in the notebook.

    Try running the cell below:

    # Run this cell to see the result
    lab = df.manufacturer.value_counts()
    print(lab)

    Modify any of the numbers and rerun the cell.

    Data science notebooks & data analysis

    Notebooks are great for interactive data analysis. Let's create a pandas DataFrame using the read_csv() function.

    We will load the dataset "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.

    By using the .head() method, we display the first five rows of data:

    # Importing the pandas module
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    pd.set_option('display.max_columns',10)
    pd.set_option('display.max_rows',20)
    
    # Reading in the data
    df = pd.read_csv('data/chocolate_bars.csv')
    
    # Define the style of backgroung chart
    plt.style.use('ggplot')
    
    df.dropna(subset=["num_ingredients", 'ingredients'], inplace=False)
    df.head()
    
    
    
    df.manufacturer.value_counts()
    df.company_location.value_counts()
    # Checking the shape of the dataframe
    countries=pd.read_csv('https://gist.githubusercontent.com/tadast/8827699/raw/f5cac3d42d16b78348610fc4ec301e9234f82821/countries_codes_and_coordinates.csv')
    print(countries)
    # Generate an overview of the dataframe
    df.info()
    # Search about missing values
    df.isnull().sum()  # or we can use : df.duplicated().sum()
    # Remove missing values
    df.dropna(subset=["num_ingredients", 'ingredients'], inplace=True)
    
    # Checking if the rows are duplicated in the dataframe
    df.drop_duplicates()
    # Show the plot of each column against the other in the dataframe
    sns.pairplot(df)
    # Display correlation coefficent for all columns pairs
    corr = df.corr()
    corr.plot(kind='bar')
    plt.show()
    
    corr
    # What is the average rating of by counrty of origin?
    
    x = df.groupby("bean_origin")["rating"].mean().head()
    x.sort_values(ascending=True)
    x.plot(kind='bar')