Writing Functions for Product Analysis

Evan Cope

‌
‌
‌
‌
‌
‌
‌

Reader mode

Beta

import pandas as pd


def convert_csv_to_df(csv_name, source_type):
    """ Converts an NPS CSV into a DataFrame with a column for the source. 

    Args:
        csv_name (str): The name of the NPS CSV file.
        source_type (str): The source of the NPS responses.

    Returns:
        A DataFrame with the CSV data and a column, source.
    """   
    df = pd.read_csv(csv_name)
    df['source'] = source_type
    return df


# Test the function on the mobile data: 
convert_csv_to_df("datasets/2020Q4_nps_mobile.csv", "mobile")

2. Verifying the files with the "with" keyword

Excellent, we have a function that reads and cleans Brenda's CSVs precisely the way she needs! She can call this function in the future for as many different sources as she wants. Before we combine the NPS DataFrames, we want to add a function that verifies that the files inputted are valid. Each of these NPS dataset files should have three columns: response_date, user_id, nps_rating. Previously, Brenda would check this manually by opening each file.

Let's write a function that uses the context manager with open() so that we properly close the files we open, even if an exception is raised. If we don't use the with keyword with open() , we would need to call close() after we're done with the file. Even then, it's risky because an error might be raised before the close() functions are called.

The function will return True if the file contains the right columns. Otherwise, it will return False. To test the function, we'll use datasets/corrupted.csv to simulate a corrupted invalid NPS file.

def check_csv(csv_name):
    """ Checks if a CSV has three columns: response_date, user_id, nps_rating

    Args:
        csv_name (str): The name of the CSV file.

    Returns:
        Boolean: True if the CSV is valid, False otherwise 
    """
    with open(csv_name) as f:
        first_line = f.readline()
        # Return true if the CSV has the three specified columns:
        if first_line == "response_date,user_id,nps_rating\n":
            return True

        return False


# Test the function on a corrupted NPS file:
print(check_csv('datasets/corrupted.csv'))

3. Putting it together with nested functions

Alright, we now have one function that verifies that the CSVs are valid and another that converts them into the DataFrame format needed by Brenda. What's left? Looking at the script, this is the last line we haven't covered: q4_nps = pd.concat([email,mobile,web]). We could use this line of code, but we'll see more code repetition if we get CSVs from other sources or time periods.

To make sure our code is scalable, we're going to write a function called combine_nps_csvs() that takes in a dictionary. Python dictionaries have key:value pairs. In our case, the CSV's name and source type will be the key and value, respectively. That way, we can define a dictionary with as many NPS files as we have and run it through combine_nps_csvs(). For each file, we'll check that it's valid using check_csv(), and if it is, we'll use convert_csv_to_df() to convert it into a DataFrame. At the start of the function, we'll define an empty DataFrame called combined and everytime a CSV is succesfully converted, we'll concatenate it to combined.

def combine_nps_csvs(csvs_dict):
    # Define combine as an empty DataFrame:
    combined = pd.DataFrame()

    # Iterate over csvs_dict to get the name and source of the CSVs:
    for name, source in csvs_dict.items():
        # Check if the csv is valid:
        if check_csv(name):
            # Convert the CSV using convert_csv_to_df():
            temp = convert_csv_to_df(name, source)
            # Concatenate combined and temp:
            combined = pd.concat([combined, temp])

        # If the file is not valid, print a message with the CSV's name:
        else:
            print(name + " is not a valid file and will not be added.")

    # Return the combined DataFrame
    return combined


my_files = {
  "datasets/2020Q4_nps_email.csv": "email",
  "datasets/2020Q4_nps_mobile.csv": "mobile",
  "datasets/2020Q4_nps_web.csv": "web",
  "datasets/corrupted.csv": "social_media"
}

# Test the function on the my_files dictionary:
combine_nps_csvs(my_files)

4. Detractors, Passives, and Promoters

We've summarized our colleague's script into one function: combine_nps_csvs()! Let's move on to analyzing the NPS data, such as actually calculating NPS. As a reminder, NPS works by asking How likely is it that you would recommend our product to a friend or colleague? with a rating scale of 0 to 10.

NPS ratings are categorized into three groups. Ratings between 0 to 6 are detractors, ratings between 7 to 8 are passives, and finally, ratings 9 to 10 are promoters. There's more to analyzing NPS, but remember, functions should be small in scope and should just "do one thing". So before we get ahead of ourselves, let's write a simple function that takes an NPS rating and categorizes it into the appropriate group.

def categorize_nps(x):
    """ 
    Takes a NPS rating and outputs whether it is a "promoter", 
    "passive", "detractor", or "invalid" rating. "invalid" is
    returned when the rating is not between 0-10.

    Args:
        x: The NPS rating

    Returns:
        String: the NPS category or "invalid".
    """
    if (x >= 0 and x <= 6):
        return 'detractor'
    elif (x >= 7 and x <= 8):
        return 'passive'
    elif (x >= 9 and x <= 10):
        return 'promoter'

    return 'invalid'


# Test our function:
categorize_nps(8)

5. Applying our function to a DataFrame

So we have a function that takes a score and outputs which NPS response group it belongs to. It would be great to have this as a column in our NPS DataFrames, similar to the source column we added. Since we've modularized our code with functions, all we need to do is edit our convert_cvs_to_df() function and nest categorize_nps() into it. However, the way we'll nest categorize_nps() will be different than previous times. The pandas library has a handy function called apply(), which lets us apply a function to each column or row of a DataFrame.

def convert_csv_to_df(csv_name, source_type):    
    """
    Convert an NPS CSV into a DataFrame with columns for the source and NPS group.

    Args:
        csv_name (str): The name of the NPS CSV file.
        source_type (str): The source of the NPS responses.

    Returns:
         A DataFrame with the CSV data and columns: source and nps_group.
    """
    df = pd.read_csv(csv_name)
    df['source'] = source_type
    # New column nps_group which applies categorize_nps to nps_rating:
    df['nps_group'] = df['nps_rating'].apply(categorize_nps)
    return df


# Test the updated function with mobile data:
convert_csv_to_df("datasets/2020Q4_nps_mobile.csv", "mobile")

6. Calculating the Net Promoter Score

If we hadn't broken down our code into functions earlier, we would've had to edit our code in multiple places to add a nps_group column, increasing the chance of introducing errors. It also helps that our functions have one responsibility keeping our code flexible and easier to edit and debug.

Now we're in a good place to calculate the Net Promoter Score! This is calculated by subtracting the percentage of detractor ratings from the percentage of promoter ratings, in other words:

$ NPS = \frac{\text{# of Promoter Rating - # of Detractor Ratings}}{\text{Total # of Respondents}} * 100 $

We want to calculate the NPS across all sources, so we'll use combine_nps_csvs() from Task 3 to consolidate the source files. As expected, that will output a DataFrame which we'll use as an input for a new function we're going to write, calculate_nps().

def calculate_nps(dataframe):
    # Calculate the NPS score using the nps_group column 
    counts = dataframe['nps_group'].value_counts()
    detract = counts['detractor']
    promo = counts['promoter']
    
    # Return the NPS Score:
    return ((promo-detract)/ counts.sum()) * 100


my_files = {
  "datasets/2020Q4_nps_email.csv": "email",
  "datasets/2020Q4_nps_web.csv": "web",
  "datasets/2020Q4_nps_mobile.csv": "mobile",
}


# Test the function on the my_files dictionary:
q4_nps = combine_nps_csvs(my_files)
calculate_nps(q4_nps)

7. Breaking down NPS by source

Is it good to have an NPS score around 10? The worst NPS score you can get is -100 when all respondents are detractors, and the best is 100 when all respondents are promoters. Depending on the industry of your service or product, average NPS scores vary a lot. However, a negative score is a bad sign because it means you have more unhappy customers than happy customers. Typically, a score over 50 is considered excellent, and above 75 is considered best in class.

Although our score is above 0, it's still on the lower end of the spectrum. The product team concludes that majorly increasing NPS across our customer base is a priority. Luckily, we have this NPS data that we can analyze more so we can find data-driven solutions. A good start would be breaking down the NPS score by the source type. For instance, if people are rating lower on the web than mobile, that's some evidence we need to improve the browser experience.

def calculate_nps_by_source(dataframe):
    # Group the DataFrame by source and apply calculate_nps():
    return dataframe.groupby(['source']).apply(calculate_nps)


my_files = {
  "datasets/2020Q4_nps_email.csv": "email",
  "datasets/2020Q4_nps_web.csv": "web",
  "datasets/2020Q4_nps_mobile.csv": "mobile",
}


# Test the function on the my_files dictionary:
q4_nps = combine_nps_csvs(my_files)
calculate_nps_by_source(q4_nps)

8. Adding docstrings

Interesting! The mobile responses have an NPS score of about -15, which is noticeably lower than the other two sources. There are few ways we could continue our analysis from here. We could use the column user_id to reach out to users who rated poorly on the mobile app for user interviews. We could also breakdown NPS by source and date to see if there was a date where NPS started clearly decreasing - perhaps the same time there was a bug or feature realeased. With the functions we created, Brenda is in a good place to continue this work!

The last thing we'll discuss is docstrings. In Task 1, 2, 4 and 5, we included docstrings for convert_csv_to_df(), check_csv(), and categorize_nps(). However, we should include docstrings for all the functions we write so that others can better re-use and maintain our code. Docstrings tell readers what a function does, its arguments, its return value(s) if any, and any other useful information you want to include, like error handling. There are several standards for writing docstrings in Python, including: Numpydoc, Google-style (chosen style in this notebook), and reStructuredText.

To make sure Brenda and other colleagues can follow our work, we are going to add docstrings to the remaining undocumented functions: combine_nps_csvs(), calculate_nps(), and calculate_nps_by_source. It's up to you how you want to write the docstrings - we'll only check that a docstring exists for each of these functions.

def combine_nps_csvs(csvs_dict):
    """
    Convert an NPS CSV into a DataFrame with columns for the source and NPS group.

    Args:
        csv_name (str): The name of the NPS CSV file.
        source_type (str): The source of the NPS responses.

    Returns:
         A DataFrame with the CSV data and columns: source and nps_group.
    """
    # Define combine as an empty DataFrame:
    combined = pd.DataFrame()

    # Iterate over csvs_dict to get the name and source of the CSVs:
    for name, source in csvs_dict.items():
        # Check if the csv is valid:
        if check_csv(name):
            # Convert the CSV using convert_csv_to_df():
            temp = convert_csv_to_df(name, source)
            # Concatenate combined and temp:
            combined = pd.concat([combined, temp])

        # If the file is not valid, print a message with the CSV's name:
        else:
            print(name + " is not a valid file and will not be added.")

    # Return the combined DataFrame:
    return combined


def calculate_nps(dataframe):
    # Calculate the NPS score using the nps_group column 
    counts = dataframe['nps_group'].value_counts()
    detract = counts['detractor']
    promo = counts['promoter']
    # Return the NPS Score
    return ((promo-detract)/ counts.sum()) * 100


def calculate_nps_by_source(dataframe):
    # Group the DataFrame by source and apply calculate_nps():
    calc = dataframe.groupby(['source']).apply(calculate_nps)
    return calc

‌
‌
‌

AI Chat

Code