ML Classification Model- Fraud Detection Use Case with an Example

Fraud Financial Industry

Use Case Scenario/ Problem Statement :

There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Data Exploration :

### Required Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
# #Define and reading the dataset from the kaggle
transaction_data = pd.read_csv("/Users/maheshg/Dropbox/Sample Datasets Kaggle/Financial Dataset Fraud/PS_20174392719_1491204439457_log.csv")
print(transaction_data.head())
print(transaction_data.columns)
#check the transaction amount and number of transactions available in the dataset:
print(transaction_data['nameDest'].nunique())

#check the transaction who originated which is unique
print(transaction_data['nameOrig'].nunique())
check the transaction amount and number of transactions available in the dataset:
print(transaction_data['nameDest'].nunique())

#check the transaction who originated which is unique
print(transaction_data['nameOrig'].nunique())

# Remove the unwanted columns because this might hamper our the predictive power and accuracy score, hence dropping these columns
transaction_data = transaction_data.drop(labels=['nameOrig','nameDest'],axis = 1)
print(transaction_data.sample(5))

#Check the transaction which are fraud :
print(transaction_data['isFraud'].value_counts())
###Plotting the figure of fraud transaction from the population
plt.figure(figsize= (12,8))
print(sns.countplot(x = 'isFraud', data = transaction_data))
#Check the transaction of the overall dataset, where it has multiple steps to complete the transaction and this plot gives the overall distribution of the transaction made
plt.figure(figsize=(12,8))
plt.ylim(0,80000)
sns.histplot(transaction_data['step'], kde=True)
plt.show()

Now we are splitting the transaction steps into multiple chunks, for example we have multiple steps involved to complete the transactions and we are seeing how many transactions falls under fraud vs non-fraud category.

transaction_data['step'] = transaction_data['step'] % 24
print(transaction_data.head())

Plotting the values using seaborn library, where we could see the below diagram how many transactions are identified as fraudulent.

plt.figure(figsize=(12,8))
sns.lineplot(x = 'step', y = 'amount',
             hue='type', ci = None,
             estimator='mean', data= transaction_data)
plt.show()

Transactions data is then distributed further and this graph depicts more like how many transactions per day are non fraud and fraud. This graph shows in distributed format by using dist plot function using seaborn library πŸ˜„ 

sns.displot(data = transaction_data, x = 'step', col='isFraud')
plt.show()

Now we can see the different types of transactions using group chart where the transaction data is classified and tagged as β€˜is Fraud’ from synthetic dataset

plt.figure(figsize=(12,8))
sns.countplot(x = 'type',hue = 'isFraud',data = transaction_data)
plt.show()

Now locate the transaction data type and filter only the required columns which are TRANSFER and CASH OUT

transaction_data = transaction_data.loc[(transaction_data.dtypes == 'TRANSFER') | \
                   (transaction_data.dtypes == 'CASH_OUT')]
print(transaction_data.shape)
transaction_data = pd.concat([transaction_data,
                              pd.get_dummies(transaction_data['type'],
                                             prefix='type', drop_first=True)],axis=1)
transaction_data.head()
transaction_data = transaction_data.drop(labels= ['type','isFlaggedFraud'], axis=1)
transaction_data.head()

After filtering and dropping some columns from the dataset, we will see how many observations are available in the dataset πŸ˜„ 

Creating the feature engineering and adding new 2 columns with existing feature

transaction_data['origBalanceDiscrepancy'] = \
    transaction_data.newbalanceOrig + transaction_data.amount - transaction_data.oldbalanceOrg
transaction_data['destBalanceDiscrepancy'] = \
    transaction_data.oldbalanceDest + transaction_data.amount - transaction_data.newbalanceDest

Plotting the model using above 2 features


sns.catplot(x = 'isFraud', y = 'origBalanceDiscrepancy', estimator=sum,
            hue='type_TRANSFER', data = transaction_data, aspect=2)
plt.show()

Now we are good to proceed with sklearn library to use train_test_split function πŸ˜„ 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
x = transaction_data.drop(['isFraud'], axis = 1)
y = transaction_data['isFraud']

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.40, random_state=123)
x_train.shape, y_train.shape
from sklearn.linear_model import LogisticRegression
logistic_clf = LogisticRegression()
logistic_clf.fit(x_train, y_train)

Accuracy = True Positive + True Negative
β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”
Number of instances

Precision = True Positive
β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-
True Positive + False Positive

Recall = True Positive
β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-
True Positive + False Negative