Feature Engineering for Fraud Detection
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Feature Engineering for Fraud Detection

    Recent estimates suggest that in 2020, credit card fraud was responsible for losses totaling 28.58 billion dollars (Nilson, 2021). The ability to accurately detect fraud protects and ensures customers' peace of mind and can prevent massive financial losses.

    The quality of predictions is highly dependent on the data and features used. This template takes raw credit card data with standard features and engineers additional information to help assist with fraud prediction.

    Imports and data preparation

    The following cells install and import the packages necessary to import and manipulate fraud detection data. They also load the example data and preview it.

    💡  Be sure to add and remove imports (if you find you don't need them) as per your requirements.

    %%capture
    !pip install geopy
    import numpy as np
    import pandas as pd
    from datetime import date
    from geopy import distance

    Load in your data

    The cell below imports the credit card data, which contains typical raw credit card transaction features such as the transaction time, the merchant, the amount, the credit card, and customer details (see Bahnsen et al., 2016 for a list of common features).

    👇  To use your data, you will need to:

    • Upload a file and update the path variable.
      • Alternatively, if you have data in a database, you can add a SQL cell and connect to a custom integration.
    • Set the column that contains the transaction time.
    • Set any other columns that contain date data (you may need to update this after loading the data in).
    # Set path to data
    path = "data/fraud_data.csv"
    
    # Specify the transaction time column
    trans_time = "trans_date_trans_time"
    
    # Specify any additional date columns
    date_cols = ["dob", trans_time]
    
    # Read in the data as a DataFrame and set the index
    fraud_df = pd.read_csv(path, parse_dates=date_cols, index_col=trans_time).sort_index()
    
    # Preview the data
    fraud_df

    Inspect the features and data types

    The first step is to inspect the columns available using the pandas' method .info(). You can also review the date types of each column.

    # Print summary of the DataFrame
    fraud_df.info()

    Customer and Transaction Details

    Extracting age from date of birth

    As noted in Bahnsen et al., 2016 (referenced earlier), the customer's age is a common feature in raw credit card data. In the cell below, we add an age column based on the date of birth.

    If your data already contains the correct age information, you can skip this step.

    👇  Make sure to update the dob label below.

    # Specify the customer date of birth column
    dob = "dob"
    
    # Define a function to extract age from date of birth
    def age(date_of_birth):
        today = date.today()
        return (
            date.today().year
            - date_of_birth.year
            - ((today.month, today.day) < (date_of_birth.month, date_of_birth.day))
        )
    
    
    # Create a new column
    fraud_df["age"] = fraud_df[dob].apply(age)

    Distance from merchant

    The next feature we will create will be the distance between the customer and the merchant. The reasoning for this feature is that the transactions at merchants that are further away from the customer may be a sign of suspicious activity.

    This step assumes you have the coordinates (latitude and longitude) of both the customer and the merchant. If you don't possess this information, you can skip this step.

    👇  Make sure to update the latitude and longitude labels below.

    # Specify the customer and merchant coordinate columns
    customer_lat, customer_long = "lat", "long"
    merchant_lat, merchant_long = "merch_lat", "merch_long"
    
    # Create function to calculate the distance between two sets of coordinates
    calculate_distance = lambda x: distance.geodesic(
        (x[customer_lat], x[customer_long]), (x[merchant_lat], x[merchant_long])
    ).km
    
    # Create a new column for the distance
    fraud_df["distance_from_merchant"] = fraud_df.apply(calculate_distance, axis=1)

    Domestic transactions

    We can also check whether a transaction is domestic or not by comparing the country of the customer and the country of the merchant.

    Note that if you do not have this data but do have latitude and longitude information, you can calculate the country using the geopy library. See this tutorial for how to extract the country based on latitude and longitude.

    👇  Make sure to update the country labels below.

    # Specify the customer country and merchant country
    customer_country, merchant_country = "customer_country", "merchant_country"
    
    # Create new column indicating domestic transactions
    fraud_df["is_domestic"] = fraud_df[customer_country] == fraud_df[merchant_country]

    Whole number transactions

    It can also be useful to determine whether the transaction is a whole number (which may indicate a suspicious transaction).

    👇  Make sure to update the transaction amount label below.

    # Specify the transaction amount column
    amt = "amt"
    
    # Create a column when the transaction amount is a whole number
    fraud_df["is_whole_number"] = fraud_df[amt] == fraud_df[amt].astype(int)
    ‌
    ‌
    ‌