Machine Learning in Financial Industry - Fraud Detection

Classification on fraud detection

Overview:

Evaluating classification models using accuracy, precision and recall
Building a classification model for fraud detection on artificially generated data

Broad Problems currently used in the machine learning

  • Classification

  • Regression

  • Clustering

  • Dimensionality reduction

Classification models are used for predict categories. Example - This model helps more to check whether Email is spam or not smap
Example 2 - Buy, sell or hold your stocks using predict categories
Example 3 - By seeing the image, we could classify whether the image refers to car or bike or truck
Example 4 - Sentiment Analysis - Whether the statement is positive or negative or neutral sentiment!

Accuracy - Precision and Recall 😀 

Accuracy talks about how the model is built and this can be determined based on the accuracy score. For example - compare the predicted and actual labels, if we receive more matches = higher accuracy then the model is good and we can proceed to train the dataset after testing with test dataset.

There might be situation where accuracy shows 99.99% but the model could be wrong, in these scenarios we could delve into the confusion matrix below

Accuracy = True Positive + True Negative
————————————————
Number of instances

Precision = True Positive
———————————————-
True Positive + False Positive

Recall = True Positive
————————————————-
True Positive + False Negative

Use Case Scenario (Kaggle) :

There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Data Exploration and Code Overview

### Required Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
# #Define and reading the dataset from the kaggle
transaction_data = pd.read_csv("/Users/maheshg/Dropbox/Sample Datasets Kaggle/Financial Dataset Fraud/PS_20174392719_1491204439457_log.csv")
print(transaction_data.head())
print(transaction_data.columns)
#Structure of the dataset
print(transaction_data.shape)
print(transaction_data.info)
print(transaction_data.isnull)
#check the transaction amount and number of transactions available in the dataset:
print(transaction_data['nameDest'].nunique())

#check the transaction who originated which is unique
print(transaction_data['nameOrig'].nunique())
# Remove the unwanted columns because this might hamper our the predictive power and accuracy score, hence dropping these columns
transaction_data = transaction_data.drop(labels=['nameOrig','nameDest'],axis = 1)
print(transaction_data.sample(5))
#Check the transaction which are fraud :
print(transaction_data['isFraud'].value_counts())
###Plotting the figure of fraud transaction from the population
lt.figure(figsize=(12,8))
sns.catplot(x = 'type', y = 'amount', estimator=sum,
            hue = 'isFraud',col = 'isFlaggedFraud',
            data = transaction_data)
plt.show()
#Check the transaction of the overall dataset, where it has multiple steps to complete the transaction and this plot gives the overall distribution of the transaction made
plt.figure(figsize=(12,8))
plt.ylim(0,80000)
sns.histplot(transaction_data['step'], kde=True)
plt.show()