Course Notes: Fraud Detection in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Fraud detection

    Imbalance techniques

    Oversampling copies elements of the minority class

    SMOTE: Synthetic minority oversampling technique. It creates synthetic samples.

    Which resampling method to use?

    • Random under sampling: throw away data, computationally efficient
    • Random oversampling: straightforward and simple, but training the model on many duplicates
    • SMOTE: More realistic and sophisticated dataset, but you are training on fake data.
    !pip install imblearn
    # Write and run code here
    # Import pandas and read csv
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    df = pd.read_csv("creditcard_sampledata.csv")
    
    # Explore the features available in your dataframe
    print(df.info())
    df = df.drop(['Unnamed: 0','Time'],axis=1)
    print(df.info())
    # Count the occurrences of fraud and no fraud and print them
    occ = df['Class'].value_counts()
    print(occ)
    
    # Print the ratio of fraud cases
    print(occ / df.count()[0])
    
    def prep_data(df):
        X = df.iloc[:, 1:29]
        X = np.array(X).astype(np.float)
        y = df.iloc[:, 29]
        y=np.array(y).astype(np.float)
        return X,y
    
    def plot_data(X, y):
    	plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
    	plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
    	plt.legend()
    	return plt.show()
    from imblearn.over_sampling import SMOTE
    
    # Run the prep_data function
    X, y = prep_data(df)
    
    # Define the resampling method
    method = SMOTE()
    
    # Create the resampled feature set
    X_resampled, y_resampled = method.fit_resample(X,y)
    
    # Plot the resampled data
    plot_data(X_resampled,y_resampled )

    Compare SMOTE to original data

    In the last exercise, you saw that using SMOTE suddenly gives us more observations of the minority class. Let's compare those results to our original data, to get a good feeling for what has actually happened. Let's have a look at the value counts again of our old and new data, and let's plot the two scatter plots of the data side by side. You'll use the pre-defined function compare_plot() for that that, which takes the following arguments: X, y, X_resampled, y_resampled, method=''. The function plots your original data in a scatter plot, along with the resampled side by side.

    def compare_plot(X,y,X_resampled,y_resampled, method):
        # Start a plot figure
        f, (ax1, ax2) = plt.subplots(1, 2)
        # sub-plot number 1, this is our normal data
        c0 = ax1.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0",alpha=0.5)
        c1 = ax1.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1",alpha=0.5, c='r')
        ax1.set_title('Original set')
        # sub-plot number 2, this is our oversampled data
        ax2.scatter(X_resampled[y_resampled == 0, 0], X_resampled[y_resampled == 0, 1], label="Class #0", alpha=.5)
        ax2.scatter(X_resampled[y_resampled == 1, 0], X_resampled[y_resampled == 1, 1], label="Class #1", alpha=.5,c='r')
        ax2.set_title(method)
        # some settings and ready to go
        plt.figlegend((c0, c1), ('Class #0', 'Class #1'), loc='lower center',
                      ncol=2, labelspacing=0.)
        #plt.tight_layout(pad=3)
        return plt.show()
    # Print the value_counts on the original labels y
    print(pd.value_counts(pd.Series(y)))
    
    # Print the value_counts
    print(pd.value_counts(pd.Series(y_resampled)))
    
    # Run compare_plot
    compare_plot(X,y, X_resampled, y_resampled, method='SMOTE')

    Fraud detection algorithms on action

    Exploring the traditional way to catch fraud In this exercise you're going to try finding fraud cases in our credit card dataset the "old way". First you'll define threshold values using common statistics, to split fraud and non-fraud. Then, use those thresholds on your features to detect fraud. This is common practice within fraud analytics teams.

    Statistical thresholds are often determined by looking at the mean values of observations. Let's start this exercise by checking whether feature means differ between fraud and non-fraud cases. Then, you'll use that information to create common sense thresholds. Finally, you'll check how well this performs in fraud detection.

    # Get the mean for each group
    df.groupby('Class').mean()
    
    # Implement a rule for stating which cases are flagged as fraud
    df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3,df['V3']<-5), 1, 0)
    
    # Create a crosstab of flagged fraud cases versus the actual fraud cases
    print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))

    Using ML classification to catch fraud In this exercise you'll see what happens when you use a simple machine learning model on our credit card data instead.

    Do you think you can beat those results? Remember, you've predicted 22 out of 50 fraud cases, and had 16 false positives.

    So with that in mind, let's implement a Logistic Regression model. If you have taken the class on supervised learning in Python, you should be familiar with this model. If not, you might want to refresh that at this point. But don't worry, you'll be guided through the structure of the machine learning model.

    # Logistic Regression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix, classification_report
    from sklearn.linear_model import LogisticRegression
    X = df.iloc[:,0:28]
    y = df.iloc[:,30]
    # Create the training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    # Fit a logistic regression model to our data
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # Obtain model predictions
    predicted = model.predict(X_test)
    
    # Print the classifcation report and confusion matrix
    print('Classification report:\n', classification_report(y_test, predicted))
    conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
    print('Confusion matrix:\n', conf_mat)
    # This is the pipeline module we need for this from imblearn
    from imblearn.pipeline import Pipeline 
    
    # Define which resampling method and which ML model to use in the pipeline
    resampling = SMOTE()
    model = LogisticRegression()
    
    # Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
    pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])