Fraud detection
Imbalance techniques
Oversampling copies elements of the minority class
SMOTE: Synthetic minority oversampling technique. It creates synthetic samples.
Which resampling method to use?
- Random under sampling: throw away data, computationally efficient
- Random oversampling: straightforward and simple, but training the model on many duplicates
- SMOTE: More realistic and sophisticated dataset, but you are training on fake data.
!pip install imblearn
# Write and run code here
# Import pandas and read csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
df = pd.read_csv("creditcard_sampledata.csv")
# Explore the features available in your dataframe
print(df.info())
df = df.drop(['Unnamed: 0','Time'],axis=1)
print(df.info())
# Count the occurrences of fraud and no fraud and print them
occ = df['Class'].value_counts()
print(occ)
# Print the ratio of fraud cases
print(occ / df.count()[0])
def prep_data(df):
X = df.iloc[:, 1:29]
X = np.array(X).astype(np.float)
y = df.iloc[:, 29]
y=np.array(y).astype(np.float)
return X,y
def plot_data(X, y):
plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
return plt.show()
from imblearn.over_sampling import SMOTE
# Run the prep_data function
X, y = prep_data(df)
# Define the resampling method
method = SMOTE()
# Create the resampled feature set
X_resampled, y_resampled = method.fit_resample(X,y)
# Plot the resampled data
plot_data(X_resampled,y_resampled )
Compare SMOTE to original data
In the last exercise, you saw that using SMOTE suddenly gives us more observations of the minority class. Let's compare those results to our original data, to get a good feeling for what has actually happened. Let's have a look at the value counts again of our old and new data, and let's plot the two scatter plots of the data side by side. You'll use the pre-defined function compare_plot() for that that, which takes the following arguments: X, y, X_resampled, y_resampled, method=''. The function plots your original data in a scatter plot, along with the resampled side by side.
def compare_plot(X,y,X_resampled,y_resampled, method):
# Start a plot figure
f, (ax1, ax2) = plt.subplots(1, 2)
# sub-plot number 1, this is our normal data
c0 = ax1.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0",alpha=0.5)
c1 = ax1.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1",alpha=0.5, c='r')
ax1.set_title('Original set')
# sub-plot number 2, this is our oversampled data
ax2.scatter(X_resampled[y_resampled == 0, 0], X_resampled[y_resampled == 0, 1], label="Class #0", alpha=.5)
ax2.scatter(X_resampled[y_resampled == 1, 0], X_resampled[y_resampled == 1, 1], label="Class #1", alpha=.5,c='r')
ax2.set_title(method)
# some settings and ready to go
plt.figlegend((c0, c1), ('Class #0', 'Class #1'), loc='lower center',
ncol=2, labelspacing=0.)
#plt.tight_layout(pad=3)
return plt.show()
# Print the value_counts on the original labels y
print(pd.value_counts(pd.Series(y)))
# Print the value_counts
print(pd.value_counts(pd.Series(y_resampled)))
# Run compare_plot
compare_plot(X,y, X_resampled, y_resampled, method='SMOTE')
Fraud detection algorithms on action
Exploring the traditional way to catch fraud In this exercise you're going to try finding fraud cases in our credit card dataset the "old way". First you'll define threshold values using common statistics, to split fraud and non-fraud. Then, use those thresholds on your features to detect fraud. This is common practice within fraud analytics teams.
Statistical thresholds are often determined by looking at the mean values of observations. Let's start this exercise by checking whether feature means differ between fraud and non-fraud cases. Then, you'll use that information to create common sense thresholds. Finally, you'll check how well this performs in fraud detection.
# Get the mean for each group
df.groupby('Class').mean()
# Implement a rule for stating which cases are flagged as fraud
df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3,df['V3']<-5), 1, 0)
# Create a crosstab of flagged fraud cases versus the actual fraud cases
print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))
Using ML classification to catch fraud In this exercise you'll see what happens when you use a simple machine learning model on our credit card data instead.
Do you think you can beat those results? Remember, you've predicted 22 out of 50 fraud cases, and had 16 false positives.
So with that in mind, let's implement a Logistic Regression model. If you have taken the class on supervised learning in Python, you should be familiar with this model. If not, you might want to refresh that at this point. But don't worry, you'll be guided through the structure of the machine learning model.
# Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
X = df.iloc[:,0:28]
y = df.iloc[:,30]
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Fit a logistic regression model to our data
model = LogisticRegression()
model.fit(X_train, y_train)
# Obtain model predictions
predicted = model.predict(X_test)
# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
# This is the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE()
model = LogisticRegression()
# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])