Certification - Bckp
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Data Analyst Professional Practical Exam Submission

    You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

    You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

    📝 Task List

    Your written report should include written text summaries and graphics of the following:

    • Data validation:
      • Describe validation and cleaning steps for every column in the data
    • Exploratory Analysis:
      • Include two different graphics showing single variables only to demonstrate the characteristics of data
      • Include at least one graphic showing two or more variables to represent the relationship between features
      • Describe your findings
    • Definition of a metric for the business to monitor
      • How should the business use the metric to monitor the business problem
      • Can you estimate initial value(s) for the metric based on the current data
    • Final summary including recommendations that the business should undertake

    Start writing report here..

    import pandas as pd
    import seaborn as sns
    import numpy as np
    from datetime import date
    
    # seaborn layout
    sns.set_style('whitegrid')
    sns.set_context('notebook')
    sns.set_palette('colorblind')
    df = pd.read_csv('https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv')

    Data validation

    The data consists of 15000 obervations (13924 after cleaning) and 9 variables:

    • week: weeks since product launch, ranges from 1 to 6. No cleaning neccessary
    • sales_method: 3 different sales methods. Capitalisation was inconsistent, some terms were truncated. I have unified the sales_method categories, all categories are now complete and capitalised.
    • customer_id: Alle customer ids are unique. No cleaning was necessary.
    • nb_sold: There are gaps in the distribution of quantities (strikingly few sales for unit numbers of 14 and 16). I have not made any changes with regard to this variable. Are there perhaps reasons for the unequal distribution that are known to you?
    • revenue: A relatively large amount of entries was missing. Since this is the variable of interest and I have not found any connection between the absence of entries and other variables I deletet all Rows with missing revenue values.
    • years_as_customer: 2 entries were larger than the time since our company exists. I deleted these 2 rows.
    • nb_site_visits: No problems, no missing values.
    • state: 50 different States. No problems, no missing values.
    df.info()
    df.isna().sum()
    df.describe()
    df.head()
    df = df.assign(null_revenue = df['revenue'].notnull())
    sns.countplot(x = 'week', hue = 'null_revenue', data = df)
    sns.countplot(x = 'week', hue = 'sales_method', data = df)
    sns.countplot(x = 'sales_method', data = df)
    df.columns
    np.mean(df.duplicated('customer_id'))
    maxYears = date.today().year - 1984
    print(maxYears)
    np.mean(df['years_as_customer'] > maxYears)