Data balancing for credit card fraud detection using complementary neural networks and SMOTE algorithm

(1)

Data Balancing for Credit Card Fraud Detection using Complementary Neural Networks and SMOTE Algorithm

by Vrushal Shah

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science (MSc.) in Computational Sciences

The Faculty of Graduate Studies Laurentian University Sudbury, Ontario, Canada

(2)

II

THESIS DEFENCE COMMITTEE/COMITÉ DE SOUTENANCE DE THÈSE Laurentian Université/Université Laurentienne

Faculty of Graduate Studies/Faculté des études supérieures

Title of Thesis

Titre de la thèse Data Balancing for Credit Card Fraud Detection using Complementary Neural Networks and SMOTE Algorithm

Name of Candidate

Nom du candidat Shah, Vrushal

Degree

Diplôme Master Science

Department/Program Date of Defence

Département/Programme Computational Sciences Date de la soutenance August 26, 2020

APPROVED/APPROUVÉ

Thesis Examiners/Examinateurs de thèse:

Dr. Kalpdrum Passi

(Supervisor/Directeur(trice) de thèse)

Dr. Ratvinder Grewal

(Committee member/Membre du comité)

Dr. Oumar Gueye

(Committee member/Membre du comité)

Approved for the Faculty of Graduate Studies Approuvé pour la Faculté des études supérieures Dr. David Lesbarrères

Monsieur David Lesbarrères Dr. Nirbhay Chaubey Dean, Faculty of Graduate Studies

(External Examiner/Examinateur externe) Doyen, Faculté des études supérieures

ACCESSIBILITY CLAUSE AND PERMISSION TO USE

I, Vrushal Shah, hereby grant to Laurentian University and/or its agents the non-exclusive license to archive and make accessible my thesis, dissertation, or project report in whole or in part in all forms of media, now or for the duration of my copyright ownership. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also reserve the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that this copy is being made available in this form by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

(3)

III

Abstract

This Research presents an innovative approach towards detecting fraudulent credit card transactions. A commonly prevailing yet dominant problem faced in detection of fraudulent credit card transactions is the scarce occurrence of such fraudulent transactions with respect to legitimate (authorized) transactions. Therefore, any data that is recorded will always have a stark imbalance in the number of minority (fraudulent) and majority (legitimate) class samples. This imbalanced distribution of the training data among classes makes it hard for any learning algorithm to learn the features of the minority class. In this thesis work, we analyze the impact of applying class-balancing techniques on the training data namely oversampling (using SMOTE algorithm) for minority class and under sampling (using CMTNN) for majority class. The usage of most popular classification algorithms such as Artificial Neural Network (ANN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGB), Logistic Regression (LR), Random Forest (RF) are processed on balanced data and which results to quantify the performance improvement provided by our approach. The experiments show that the hybrid approach which integrates Complementary Neural Network and Synthetic Minority Oversampling Technique gives a Quantitative performance in terms of Accuracy of 99% and 99.7% of AUC with Random Forest Classification Algorithm compared to simple undersampling and oversampling.

Key–Words: Fraud Detection, Complementary Neural Network, SMOTE, oversampling, under sampling, Class Imbalance

(4)

IV

Acknowledgments

I would like to thank my supervisor, Dr. Kalpdrum Passi, who helped and guided me step by step and imparted deep knowledge and was compassionate and kind with me throughout the completion of this project. I would also like to thank my family and friends who were always there to support me. I would not have been able to complete my work without their immense support.

(5)

V

List of Figures

Figure 1.1 Credit Card Fraud Reports in the United States [8] ... 4

Figure 2.1 Temporal feature extraction from transaction data ... 11

Figure 3.1 Class Imbalance (Showing Skewness) ... 14

Figure 4.1 System Flowchart ... 19

Figure 4.2 Proposed Pipeline ... 20

Figure 4.3 Area under the ROC curve ... 24

Figure 4.4 Data Balancing Network ... 25

Figure 4.5 Complementary Neural Network Design ... 26

Figure 4.6 Complementary Neural Networks ANN Architecture ... 27

Figure 4.7 Synthetic Minority Over-sampling Technique [23] ... 31

Figure 4.8 Scatter plot of raw data (before applying SMOTE) ... 32

Figure 4.9 Scatter plot of processed data (after applying SMOTE) ... 33

Figure 4.10 Classification Algorithms (used for final analysis) ... 34

Figure 4.11 Maximum Margin Hyperplanes (MMH) separating two classes in SVM [24] ... 36

Figure 4.12 Logistic or Sigmoid function [25] ... 38

Figure 4.13 Neural Network Architecture with 1-Hidden Layer [26] ... 39

Figure 4.14 Random Forest example with constituent Decision Trees [27] ... 41

Figure 5.1 Analysis of incremental training of CMTNN ... 46

Figure 5.2 Analysis of incremental training of CMTNN ... 47

(8)

1

Chapter 1 Introduction

1.1 Introduction

Digital payment has experienced revolutionary changes over the last decade and is expected to keep up that trend for many years to come. Payment systems (however primitive) have existed for a very long time in the past. Starting from the ancient barter systems for exchanging goods to the modern one-click payment systems, we have come a long way. But with significant disruptions arises the need for more significant regulations, especially when it is related to financial transactions. With the technological advancement leading in smarter payment systems, there is an imminent need for a fast-paced evolution of the security systems that regulate these payments. The need for safer and more secure transactions has been globally accepted and is an active area of research. For modern regulatory security frameworks, it is equally important that they be robust (to adapt to different payment processes) and scalable (to afford large quantities of transactions).

The need for such security systems is alleviated due to rising cases of online thefts related to fraudulent transactions, causing people to lose millions every year. The different types of online payment frauds include (but are not limited to):

1. Friendly Fraud 2. Triangulation 3. Clean fraud 4. Identity theft

These are discussed in the next section.

(9)

2 1.2 Online Payment Frauds

For the scope of this work, online payments can be broadly classified into "Card-present" and

"Card-not-present" transactions. In the former type of payments, the user (or the authorized cardholder) has to present the card physically at the time of the purchase (such as in-store purchases) to complete their transactions. In contrast, in the latter type of payments, the authorized user only verifies the security credentials and can perform a successful transaction without physically presenting the card. The Most conventional ways that fraudsters use to deceive merchants and customers are:

1. Friendly Fraud

In this type of fraud, a customer-first makes a digital purchase with their credit card and then communicates their credit card issuer to argue for the charges by behaving like they never made the payment or the payment was made, but they did not receive the

product/service. Obviously, not all chargebacks are fake – ordinarily; these cases may be legitimate. But, with very little evidence to verify the story, it has become a popular method for fraudulent activities in the last few years. It makes direct misfortune vendors as well as gets them penalized by the card issuer.

2. Triangulation

The Triangulation fraud method involves an unsuspecting customer, a fake online store, and a mechanism to steal data. In this scenario, once the customer has made a transaction of its purchase, then the fraudulent merchant immediately takes his card details and cancels the transaction. Since the transaction was never successful, most of such activities go unnoticed by automated systems. The fake merchant now has access to confidential card details of the consumer and may use it later.

3. Clean fraud

As discussed earlier, Friendly Fraud takes cover behind fake identities, or stolen data, hackers or fraudsters that go for an open theft usually have a great source of knowledge about the cardholders and their credit card information, and they use this customer

(10)

3

information to fool the systems. In this type of fraud, the criminal can be able to steal all- important real data and information and uses it to make a purchase that looks legitimate.

4. Identity theft

Another kind of online payment fraud that can be very basic is identity theft. In this type of fraud, the pretender secures the key details of the customer-holding credit card and its personally identifiable information and then uses it for fraudulent purchases on the Internet. What sets this type of fraud apart is that the fraudster does not even need the card credentials, but only personally identifiable information related to the consumer.

Now, while it can be challenging to tackle all the forms of fraudulent transactions with one single approach, we have made an effort to detect "Triangulation", "Friendly Fraud" and

"Identity Theft" using the transaction data analysis.

We have focused on the transaction data of credit cards, which is readily available for our analysis. Credit Card frauds are among the most natural and most rapidly rising scams in the modern world. While they provide a seamless service of executing monetary transactions, it also makes the task of detecting a fraudulent transaction equally harder. E-commerce and other similar online transactions account for a vast portion (roughly 80 %) of such fraudulent transactions.

With the growing number of credit card frauds (Figure 1.1), researchers have shown increasing interest related to fraud detection using classical machine learning approaches [1], [5], [7] as well as newer AI-based algorithms [6].

(11)

4

Figure 1.1 Credit Card Fraud Reports in the United States [8]

But one of the biggest bottlenecks in solving the classification problem (fraudulent transaction vs. legitimate transaction) is the imbalance in the transaction data. Since it is evident that the number of genuine transactions is far more than the number of fraudulent transactions, it becomes tough for any learning algorithm to be trained on the features of the minority class samples (fraudulent credit card transactions).

Several other challenges associated with credit card detection are namely fraudulent behavior profile are dynamic, that is fraudulent transactions tend to look like legitimate ones; credit card transaction datasets are rarely available and highly imbalanced (or skewed); optimal feature (variables) selection for the models; suitable metric to evaluate the performance of techniques on skewed credit card fraud data.

(12)

5 1.3 Imbalance Problem

The dataset used in this study is sourced from the ULB (Université Libre de Bruxelles) Machine Learning Group, and a description is found in [31]. The dataset holds credit card transactions made by European cardholders in September 2013. This dataset presents transactions that occurred in two days, consisting of 284,807 transactions. The positive class (fraud cases) make up 0.172% of the transaction data. The dataset is highly unbalanced and skewed towards the positive class

It is essential to understand that since over 99% of the transactions are non-fraudulent, an algorithm that always predicts that the transaction is non-fraudulent would achieve an accuracy higher than 99%. Thus, the accuracy metric in itself can be highly misleading in such cases of an extremely skewed dataset. Thus, our focus throughout this study has been on improving higher- order metrics like "AUC (Area under ROC curve)," "AUPRC: Area under Precision-Recall Curve," "F1 score," etc.

1.4 Objective of the Study & Outline of the thesis

The main objective of this study is to present an information-theory centric approach for handling Imbalanced Data Distribution using an Intelligent Sampling Network (or Data Balancing Network) based on CMTNN (complementary neural networks) and SMOTE (Synthetic Minority Over-sampling Technique). Different versions of the Data Balancing Network were tried and analyzed to achieve optimum performance. Firstly, different versions of the Data Balancing Network were compared for "the extent of Data Compression" and "the improvement in the accuracy metrics." Secondly these results were used to select the best version of the Data Balancing Network, which was later fixed in the pipeline and experimented with different classification algorithms to understand its impact on the predictions.

Imbalanced Binary Classification is a problem domain that has a lot of extremely well-written literature and yet is an unsolved problem. Through this study, we intend to register our contribution as well for this domain. This active research area is also closely linked to a lot of

(13)

6

practical scenarios that have significant implications in our lives. For the scope of this study, we have chosen to focus on the problem of "Credit Card Fraud."

⇒ Research Question (that our work aims to answer)

1. How do we handle and make use of imbalanced data to the tune of 0.172% minority class samples?

2. Can we develop a domain-independent technique to reduce the extent of the imbalance in the dataset?

3. How well do standard performance metrics like “Accuracy” and “Loss” fare when evaluating the Imbalanced Binary Classification problem?

4. What existing metrics or any new metric is best suited for performance evaluation in Imbalanced Binary Classification?

It is extremely important to understand at this stage that our research is not aimed at introducing a novel FDS (Fraud Detection System) for credit card fraud detection. In fact, the primary aim of our work is not aimed at proposing a solution to the problem of "Credit Card Fraud Detection".

The focus of this work is to present a novel "Data Balancing Network" architecture, which can easily be plugged into any Imbalanced Binary Classification problem. The said network will use Deep Learning (CMTNN) and Statistical Machine Learning (SMOTE) to balance out the number of samples for each of the classes without introducing redundant samples in the minority class or removing any crucially important samples from the majority class.

This thesis is organized in the following sequence:

● Chapter 1: An introduction of the problem that we aim to solve has been discussed, so that the reader is mindful of both the need and the benefits of our research work.

● Chapter 2: A literature review comparing existing state-of-the-art techniques is presented. The importance of our work in the context of the existing literature and their limitations is explained.

(14)

7

● Chapter 3: The dataset used in this study is explained and analyzed. And also data preprocessing is explained in this chapter

● Chapter 4: A detailed description and analysis of the Methods and Classifiers used are explained

● Chapter 5: This chapter includes compilation and interpretation of the results.

● Chapter 6: In the final chapter, we conclude our study along with suggestions for future work.

(15)

8

Chapter 2 Literature Review

In this chapter, we provide an account of the background literature related to the existing work in the domain of "Credit Card Fraud Detection," projecting a major focus on the handling of imbalanced transaction data.

In Chapter 1 we discussed a few reasons explaining why detecting fraud is such a difficult task.

In a broad sense, the single biggest drawback (positive drawback) in fraud detection is the fact that the number of genuine cases always highly outbalance the number of fraud cases. This leaves very little scope for analyzing and extracting patterns from fraudulent transactions.

A few more fundamental challenges in detecting fraudulent transactions have been very well- drafted in [15] and [16]:

● Any authentic, real-world data will always have an extremely large number of genuine transaction samples while containing only a handful of fraudulent cases. This is probably because fraudulent transactions, although they have increased rapidly over time but are a case of an anomaly in the regular genuine transactions. To make it worse, not all of the transactional data can be made available publicly due to the user-sensitive nature of such data. This makes it extremely difficult to create and validate any kind of detection system. To tackle this issue, people have tried to use statistical techniques such as oversampling, under sampling, and other such data manipulative methods. While these techniques do prove to be useful in prediction for some Machine Learning algorithms, there is an inherent risk of losing the authenticity of the dataset after applying these techniques.

● Another important issue to highlight is the practical implementation side. In the domain of banking, fraudulent transactions must be detected in real-time such that it can be blocked before it has been executed. Thus, implying the need for a robust and accurate FDS (Fraudulent Detection systems) which is also capable of providing real-time

(16)

9

classification results (classifying between genuine and fraud). It also needs to be easily scalable in order for it to not become obsolete sooner than later.

Based on these characteristic challenges it can be concluded that modern Statistical Machine Learning approaches like Support Vector Machine [5], Logistics Regression and Artificial Neural Network [6] are the best options to provide a coherent and wholesome solution.

For a very long time, Banks and almost all financial institutions have addressed fraud detection by modeling it through rule-based systems. The primitive rule-based Fraud Detection System is not of much use today as it is neither scalable nor does it settle well with the imbalance observed in the dataset. The underlying flaw with such modeling is perhaps more evident now than ever.

The banking institutions had not anticipated the violent growth and technological disruption that was about to come. With technological advancements in the banking industry, the nature of the transactions diversified and the payment services grew at an unprecedented level. The rule-based systems were not scalable and thus had to be replaced with newer machine-learning-based systems.

One of the most popular architectures for the Fraud Detection System was suggested by Bolten et al. [16]. Bolton defined an outlier as an instance (or a pattern) in a dataset that does not conform to the expected behavior. Outlier detection is also used by intrusion detection systems in the domain of network intrusion, where an intrusion detection system often uses signature/pattern-based or anomaly detection techniques to detect intrusions. While the idea was simple yet powerful, it still suffered an important drawback of being neither scalable nor robust to advancements and disruptions in the technology. Since then, there have been numerous un- supervised and supervised approaches [2], [9], [10], [11], [14] that have been able to solve the problem to some extent but many of them still lacked the balance of scalability and robustness.

Among un-supervised approaches, clustering [8] by Vaishali et al. was one of the earliest to have gained popularity but failed to keep up with the growing amount of transaction data as it became computation heavy and results deteriorated significantly with the growing number of transactions. The earliest supervised approaches as shown by Dhankhad in [5] include Support Vector Machine, Random Forest, Extreme Gradient Boosting (XGB), etc. have almost all shown considerable improvement over the pre-existing clustering approaches. Dhankhad in [5] also experimented with under sampling the majority class to show improvement in performance. It is

(17)

10

from this reference that we garner our motivation to first attack the class imbalance itself and analyze the extent of performance improvement.

For some context, if we look back to several years of development, we can easily notice that there have been strong shifts, both technological and behavioral, in banking technology and day- to-day banking transactions. With each wave of technological disruption, we observe newer and more enhanced payment structures, which effectively alter the characteristic properties of a transaction itself. Thus, in a sense, all newer transactions become an outlier to the existing corpus of genuine transactions. Thus, any method (algorithm, model) employed for fraud detection must also be resilient to the changing technological outlook.

It is also interesting to note that consumer behavioral patterns themselves also contain useful information that could be incorporated into the FDS for better and more robust detections. Thus, more advanced techniques like Hidden Markov Model, Artificial Neural Network [6], Support Vector Machine [5], etc. are better suited for handling such fraud detection tasks. Also, because there is often a lack of "fraud" labeled data available, generative methods are expected to be more robust for anomaly detection than discriminative methods. Discriminative methods try to optimize a decision rule that classifies data into categories. Such methods do not model the relationships between the data and the underlying system process. Generative methods, on the other hand, try to learn a model that describes the system process. Due to the class imbalance in the dataset, discriminative models fail to capture enough discriminating rules to be able to successfully classify the transaction samples.

To tackle this imbalance problem at the data layer itself so that it does not affect the subsequent learning model, researchers have tried a lot of data up-sampling and down-sampling approaches.

A random sampling approach is also used in [9, 10] and reports experimental results indicating that 50:50 artificially distribution of fraud/non-fraud training data generate classifiers with the highest true positive rate and low false-positive rate also hinting that reducing the class imbalance can result in much better performance.

Furthermore, a single transaction often provides not enough information to detect fraud, and thus often more sophisticated aggregate measures are employed in a Fraud Detection System. In Figure 2.1, examples of two aggregated features for transaction T with different time windows are illustrated. For instance, these features could represent the number of performed transactions

(18)

11

for the bank account within the past day or two days, respectively, for features x and y. Then x is two because only transactions T4 and T5 are within the time window of feature x. Similarly, for y, which contains five transactions within the time window of feature y.

Figure 2.1 Temporal feature extraction from transaction data

Important features are often selected by the help of experts. These features are then used to define rules which can then be used by rule-based detection systems, as discussed in [17]. The downside of such a system (as already discussed above) is that it requires many interactions with experts of domain knowledge to maintain the rules. Also, with the growing scale of the number of transactions, it soon becomes extremely hard to be able to appropriately adapt a rule-based system. Nevertheless, people have used popular rule-based techniques like association-rule mining to perform detections [18]. Another rule-based technique that has often been used is Random Forests [7].

To handle the problem of class imbalance along with the exploding amount of transaction data, we have to tackle it via data processing before feeding the data to a learning model. We need an efficient way to under-sample the majority class and up-sample the minority class to establish sufficient class-balance while also keeping the authenticity of the data intact.

In [11], the authors use stratified sampling to under-sample the legitimate records to a meaningful number. It experiments on 50:50, 10:90 and 1:99 distributions of fraud to legitimate cases and reports that the 10:90 distribution has the best performance, perhaps owing to its authenticity to the real-world data distribution.

Thus, we acknowledge two challenges in an undersampling approach. First, simply applying a random sampling approach does not solve the problem, and secondly, the resulting balance among the majority and the minority class will also impact the resultant performance of the classifier. To solve these challenges, we apply a comprehensive solution. We choose

(19)

12

Complementary Neural Network [3] for under sampling because it follows a learning-based approach i.e., it learns the features of the majority class (legitimate transactions) and then removes only those samples from the training data that have relatively low information content and would essentially be redundant in training the classifier. This automatically takes care of the second challenge as the resulting balance among the classes would be an optimum balance i.e., one containing maximum information. This learning-based approach also makes sure that the authenticity of the data is not affected. It retains those majority class samples that contain most information and introduces non-redundant newer minority class samples. This compressed data is then used for training multiple classifiers to observe the performance improvement provided by this approach. In this thesis, we also experiment with different degrees of under sampling, the results of which will be analyzed later on in this thesis.

(20)

13

Chapter 3 Data Processing

In this chapter, our discussion will be centered on the dataset and its processing that we have used for our research work. Please note that the research work that has been presented and the proposed "Data Balancing network" are both data-independent. Thus, they can be easily plugged into any other Binary Classification problem. The choice of our data is completely driven by choice of our use case. To present our proposed pipeline and analyze its effectiveness in solving the imbalanced class distribution problem, we had opted for the "Credit Card Fraud" detection problem. It is for this reason that we have chosen the popular credit card transaction dataset curated and managed by the ULB Machine Learning Group.

Going ahead in this chapter, we will look further into the details of the dataset and the underlying imbalanced class distribution problem. We will also discuss more the existing solutions to this imbalance problem and contrast and compare them to our proposed "Data Balancing Network."

3.1 Dataset Description

The dataset used in our study is sourced from the ULB (Université Libre de Bruxelles) Machine Learning Group, and a description is found in [14]. The dataset contains credit card transactions made by European cardholders in September 2013. This dataset presents transactions that occurred in two days, consisting of 284,807 transactions. The positive class (fraud cases) make up 0.172% of the transaction data. The dataset is highly unbalanced and skewed towards the positive class (as shown in Figure 4.1). It contains only numerical (continuous) input variables, which are as a result of a Principal Component Analysis (PCA) feature selection transformation resulting in 28 principal components. Thus, a total of 30 input features are utilized in this study.

The details and background information of the features could not be presented due to confidentiality issues. The time feature contains the seconds elapsed between each transaction and the first transaction in the dataset. The 'amount' feature is the transaction amount. Feature

(21)

14

'class' is the target class for the binary classification, and it takes value 1 for positive case (fraud) and 0 for negative case (non-fraud).

Figure 3.1 Class Imbalance (Showing Skewness)

It is important to understand that since over 99% of the transactions are non-fraudulent, an algorithm that always predicts that the transaction is non-fraudulent would achieve an accuracy higher than 99%. Thus, the accuracy metric in itself can be highly misleading in this case of the extremely skewed dataset. Thus, our focus throughout this study has been on improving higher- order metrics like "AUC: Area under ROC curve," "AUPRC: Area under Precision-Recall Curve", "F1 score" etc. While all of these metrics were employed during the analysis of Complementary Neural Network and Synthetic Minority Oversampling Technique performance, for visualization purposes and for clarity in the presentation of results, we will be considering only "AUC: Area under ROC curve" and "accuracy" metrics.

3.2 Class Imbalance problem

One of the toughest bottlenecks in credit card fraud detection arises due to the nature of the data.

While there are millions of samples of fraudulent transactions, there are also billions of genuine transactions. Thus, imbalance of the class distribution is inherently present due to the nature of the domain of the problem. The dataset that we have considered contains only 0.172% of samples belonging to the minority class, as shown in Figure 3.1

Negative, 284,807 positive, 490

(22)

15

It is also important to note that a lot of credit cards fraudulent transactions go unreported, and thus, it results in minority class samples being incorrectly labeled as majority class samples, which further increases the problem in classification. So, we are left with a dataset which, although huge in size, contains very few samples of the minority class (fraudulent transactions) while our main focus is directed at detecting the minority class samples correctly. Hence, any algorithm that we design must have a high "True Positive" fraction and low "False Negative"

fraction.

This also perfectly resonates with the practical implications of the detection/classification problem at hand. It is safe to assume that any bank would be more interested in detecting the fraud transactions more accurately while compromising (classifying some "Genuine"

transactions as "Fraud" transactions) on the correct classification of few genuine transactions.

3.3 Solutions to the imbalance problem

Learning from unbalanced datasets is a difficult task since most learning algorithms are not designed to cope with a large difference between the numbers of cases belonging to different classes [19]. There are several methods that deal with this problem, and we can easily distinguish between them based on the level that they operate on. Most of these techniques operate either on data level or on an algorithmic level. The authors in [6], [7] have presented a valuable resource on distinguishing and analyzing the data level techniques and algorithmic level techniques.

At the data level, the unbalanced strategies are used as a pre-processing step to rebalance the dataset or to remove the noise between the two classes, before any algorithm is applied. The aim of such methods is to analyze the underlying (imbalanced) data distribution and modify/process it to a desirable (balanced) data distribution. These methods are primarily focused on obtaining a reasonably well-balanced data distribution and then apply the classification algorithms as usual.

At the algorithmic level, algorithms are themselves adjusted to deal with the minority class detection. In this category, the data distribution is not the center of the focus, and thus no effects

(23)

16

are driven by the need to improve the data distribution itself. Instead, the focus is always to tune and modify the classification algorithm to obtain better classification scores. The classification algorithms are manipulated in a manner such that they learn to handle the stark skewness in the dataset as a part of their classification process.

Data level methods can be grouped into five main categories:

● Sampling,

● Ensemble,

● Cost-based,

● Distance-based

● Hybrid.

Within algorithmic methods instead, we can distinguish between:

● Classifiers that are specifically designed to deal with an unbalanced distribution

● Classifiers that minimize overall classification cost.

The latter is known in the literature as cost-sensitive classifiers [20]. Both data and algorithm level methods that are cost-sensitive target the unbalanced problem by using different misclassification costs for the minority and majority classes. At the data level, cost-based methods sample the data to reproduce the different costs associated with each class. Cost- sensitive classifiers instead directly minimize the costs by using cost specific loss function.

3.3.1 Our proposed solution

Standard data processing techniques like random under-sampling and replicative oversampling have existed among the research community for a long time and have also been successful in several instances. But in our particular case of fraud detection, the aim is not primarily to ease computation (this is usually where normally random sampling proves effective), and replicative oversampling can easily result in overfitting of our classifier. Therefore, in our case, we need a data-centric or information-theory centric approach to under-sampling and oversampling.

The proposed solution uses CMTNN for under-sampling the majority class while preserving training samples that provide crucial information for feature learning. For the oversampling part,

(24)

17

it uses the Synthetic Minority Oversampling Technique (SMOTE) to generate relevant samples for the minority class.

In the subsequent chapters, we will present the details of the research experiments that were carried out and also discuss in detail the underlying theory and motivation behind using the frameworks that were employed. We will look into the details of the architecture of the "Data Balancing Network" and also its building blocks "CMTNN: Complimentary Neural Networks”

and “SMOTE: Synthetic Minority Over-Sampling Technique”. We will also present a detailed account of the implementation of the above-mentioned two algorithms.

(25)

18

Chapter 4 Methods

4.1 Proposed Method

While we do not propose a novelty in the classification algorithm itself, we bring together different techniques of data analytics (Complementary Neural Network and Synthetic Minority Oversampling Technique) and Machine Learning (Support Vector Machine, Random Forest, Extreme Gradient Boosting, Logistics Regression, and Artificial Neural Network) in a systematic manner to tackle both data-level challenges and learning level challenges. As discussed in our experiments' credit card transaction data is highly skewed towards legitimate transactions as compared to fraudulent transactions.

The Flowchart of the overall work is described in Figure 4.1 which shows the flow of the data preprocessing followed by the balancing of the data by undersampling and oversampling using a hybrid approach of CMTNN and SMOTE which is followed by the classification process and to get the final output.

(26)

19

Figure 4.1: System Flowchart

In our proposed method, we have designed a two-block pipeline (Figure 4.2). The first stage deals with Data Balancing Network (balancing the skewness using under sampling and oversampling techniques), and the second stage deals with classification algorithms. The second stage is further divided into two sub-stages. The first sub-stage deals with classical machine learning algorithms, and the other sub-stage deals with advanced deep learning algorithms.

(27)

20

Figure 4.2: Proposed Pipeline

As shown in Figure 4.2, our primary focus has been on the first block of our pipeline, "Data Balancing Network." It is in this section that we perform the majority of the computations. This section although computationally intensive, the two processes i.e., under-sampling (CMTNN Network) and oversampling (SMOTE Algorithm) are independent of each other in the sequential sense and therefore can be parallelized (this will be part of our future work and thus will not be seen implemented as part of this thesis).

We discuss in brief how the “Data Balancing Network” was conceptualized and then explain the improvisations that were incorporated on the way. As explained, the focus of our research was to build a domain-independent "Data Balancing Network" to handle any kind of imbalanced binary classification dataset. The said network should also be simple enough to plug and play into any binary classification algorithm. Thus, considering all such requirements, we have structured our experimental procedures in the following manner:

1. Step 1: Decide upon a particular Binary Classification problem. For the scope of this research study, we have chosen “Credit Card Fraud” detection problem.

2. Step 2: Choose an authentic real-life dataset related to the chosen problem. (For our chosen problem, we decided to use “Credit Card Fraud” dataset provided by the ULB Machine Learning Group on the platform of Kaggle)

(28)

21

3. Step 3: Perform different variations of the proposed “Data Balancing Network” to get the final optimum network architecture and settings.

4. Step 4: After having decided upon the optimal version of the "Data Balancing Network,"

plug it with a few chosen binary classification techniques and analyze the improvement provided by the proposed network. For the scope of our research, we have chosen artificial neural network (ANN), support vector machine (SVM), Random Forest, XGBoost and Logistic Regression.

The experimental procedures enumerated above were diligently followed to establish a standard research framework. This helps to keep the research work easily replicable. Figure 4.1 gives a visual representation of the different stages of our experimental setup. We chose the "Credit Card Detection Problem" because it is one of the most highly imbalanced cases of binary classification, and also one of the most researched domains owing to the immense importance it holds in the real-life scenarios. We chose our dataset from the ULB Machine Learning Group repository(Kaggle). The said research group had curated an authentic version of the credit card transaction dataset containing only 0.172% samples for the class in the minority. We started experimenting with the most basic version of our proposed "Data balancing network" and kept on improving its performance in terms of reduction in the class imbalance. Once having attained a sufficiently good enough balancing of the dataset, we froze that model of the "Data Balancing Network". This best performing version of the network was then plugged together with different binary classification algorithms and used to perform binary classification on our dataset. The binary classification outputs were compared for the set of classification algorithms with and without the "Data Balancing Network" to analyze the improvement in performance provided by our "Data Balancing Network."

For our research, we have considered the following classification algorithms to evaluate the effect of our “Data Balancing Network”:

1. ANN (Artificial Neural Network) 2. SVM (Support Vector Machines) 3. Random Forest

4. XGBoost

5. Logistic Regression

(29)

22

The performance evaluation part holds great importance in our study. Due to the imbalanced nature of the dataset, standard performance metrics like "accuracy" and "Loss (performance metrics to analyse how bad the model’s prediction is)" fail to present a true picture of the performance of any algorithm. For example, the dataset that we have used contains only 0.017%

of fraudulent transactions. Thus, any classifier can achieve an accuracy score of 99.983% by classifying all the transactions as "Genuine." Therefore, for the scope of this study, and also based on the use case (Credit Card fraud detection), we have decided to keep our focus more on the "AUC (Area under ROC curve)” metric.

In the next section, we present our reasoning of using the "AUC (Area under the ROC curve)"

metric for determining the classification performance score. For each of the classification algorithms that were used in our study, we have recorded the "accuracy" and the "AUC" scores using three different versions of our "Data Balancing Network." The "accuracy" metric was recorded only for observation purposes and was not utilized to analyze the performance improvement provided by our "Data Balancing Network."

4.2 Evaluation Framework

In this section, we will focus on the performance evaluation metrics employed for our experiments. Primarily we will discuss the need for not using the standard "accuracy" and "loss"

values for scoring our system and we present our reasoning for using “AUC” as a good performance metric which reflects a true picture of the classification performance for Imbalanced Binary Classification problem.

“Accuracy” scores in Classification have been long used for evaluating the strength of a classifier. The reason for its popularity is 3-fold:

● It is easy to Calculate

● It is easy to interpret.

● It is easy to present an analysis based on a single number that sums up the performance strength of the complete system.

(30)

23

But in the case of Imbalanced Binary Classification, when the class distribution in the samples is severely skewed i.e., any one of the classes becomes extremely dominant owing to discriminatingly large numbers of samples of that class, "accuracy" scores become an unreliable metric of performance. The reason for this unreliability stems from the probability theory. Since one of the classes has an extremely large number of samples in the dataset, sometimes to the tune of 99.9%, probabilistically speaking, the chance of occurrence of that particular class has become as high as 99.9%. Thus, any naive algorithm can simply predict that class for all of the samples in the dataset and inaccurately achieve an accuracy score of 99.9%. This score, although looks impressive at first glance, is in no way close to the true performance measure of the algorithm, which is blindly predicting the occurrence-dominant class for all samples.

It is interesting to note that the "accuracy" metric is not at fault here. In fact, it still delivers the correct score. The "accuracy" score is still a true value of classification accuracy. But the predominant perception in the mind of a data-science or machine-learning researcher will lead him to believe that 99.9% is a great score, while actually, 99.9% is achievable without any thought and thus is probably not as great as it may look. Hence, perhaps in the actual case of imbalance classification, 99.999% might be a considerable score. But again, analyzing accuracy scores in only a range of 0.1% is not practically feasible and will introduce problems of its own.

Thus, the need arises for an evaluation metric:

● which presents a true picture of the performance strength of any given algorithm.

● which is simple enough to be used for critical comparison of relative performance strength of different classification algorithms

● which survives the test of Imbalanced Class problem and presents a reasonable score value focusing more on "False Negatives" and "True Negatives." (Assuming that the minority class is considered as the positive class)

Keeping all the above considerations in mind, we suggest using the "AUC" scores for model performance comparison. AUC is the area under the ROC curve. Since the ROC curve is framed inside a unit area square, the resultant AUC score is a real number ranging between 0.0 and 1.0.

One important reason for "AUC" scores to be better able to reflect true model performance strength for Classification when compared to "accuracy" scores is the way the two scores have been defined:

(31)

24

● Accuracy: For a given threshold on the prediction value of the model, it calculates the fraction of samples that have been correctly assigned to the class of their belonging, regardless of which class they belong to.

● AUC: it measures the probability that considering two random samples, one from the majority (negative) and one from the minority (positive) class, the classification algorithm will output a higher value for the sample from the positive class as compared to the sample from the negative class.

Figure 4.3 Area under the ROC curve

The accuracy depends on the threshold chosen, whereas the AUC considers all possible thresholds. Because of this, the AUC is better suited to present a "broader" view of the classification performance of any given algorithm.

Having established a strong understanding of the Research Framework, we next discuss the Data Balancing Network which targets our major focus presented in section 4.3

(32)

25 4.3 Data Balancing Network

In this section, we discuss the top-level overview of our "Data Balancing Network" (Figure 4.4).

We also explain our motivation for using this architecture and contrast it with other existing solutions to handling imbalance class distribution in a dataset. This section will set up the base for subsequent articles, which will further discuss each of the parts (CMTNN Network and SMOTE algorithm) of the "Data Balancing Network" and explain the implementation procedure that we have followed in building up this network.

Figure 4.4 Data Balancing Network

Standard data processing techniques like random under-sampling and replicative oversampling have existed among the research community for a long time and have also been successful in several instances. But in our particular case of fraud detection, the aim is not primarily to ease computation (this is usually where typically random sampling proves useful), and replicative oversampling can easily result in overfitting of our classifier. Therefore, in our case, we need a data-centric or information-theory centric approach to under-sampling and oversampling.

Thus, according to Figure 4.4 we have used CMTNN Network for under-sampling while preserving training samples that provide crucial information for feature learning. For the oversampling part, we have used the Synthetic Minority Oversampling Technique (SMOTE) to generate relevant samples for the minority class.

(33)

26 4.4 Complementary Neural Network (CMTNN)

In recent years, Complementary Neural Networks (CMTNN) based on feedforward neural networks have been proposed to deal with both binary and multiclass classification problems [12], [13]. Instead of considering only the correct information, CMTNN found both truth and falsity information in order to enhance the classification results. CMTNN consists of truth neural network and Falsity neural network created based on truth and falsity information, respectively.

It was found that CMTNN provides better performance compared to the traditional feedforward neural networks [12], [13]. Figure 4.5 shows the Complementary Neural Network design.

Figure 4.5 Complementary Neural Network Design

The Complementary Neural Network Design is designed in an efficient manner where the training samples are given to both the networks but the targets are complementary to each other.

Then the training samples are passed through learned function of each neural network and then combined with an aggregator function where the unnecessary samples are removed

In our design of the CMTNN (Figure 4.6), we train two Artificial Neural Networks “Truth NN”

and “Falsity NN”. Both networks are structurally identical in design shown in Figure 4.6 and follows the same ANN architecture inherently. The only difference is that Truth NN uses the

(34)

27

original targets/labels while training, whereas the Falsity NN uses complementary targets while training.

Figure 4.6 Complementary Neural Networks ANN Architecture

In our Architecture of CMTNN (Figure 4.6) describes the architecture of single neural network such as “Truth NN” and “Falsity NN”. Here we use three hidden layers and some functions such as Relu, Dropout and BCE loss which are explained below

Relu: Relu is one of the common activation function of Artificial Neural Network it is also known as rectified linear unit. The mathematical function for Relu is y=max(0,x) where x is a input to the neuron

Dropout: It is a function which is used as a regularization technique in Artificial Neural Network to prevent overfitting of values

BCE loss: The BCE loss is used when we have binary classification which works in the following way when we have one output node and need to classify into two different classes

4.4.1 Implementation of CMTNN network

To apply CMTNN for the under-sampling problem, Truth NN and Falsity NN are employed to detect misclassification patterns i.e. all the patterns classified incorrectly. The intersection of these incorrectly typed samples is then removed from our training data.

(35)

28

a. Truth NN is trained on the training data keeping the target values as it is to learn a classification function (T). Falsity NN is trained on the training data with target values complementary to that provided for Truth NN to learn a classification function (F).

b. Both the networks after reaching convergence are used to classify the majority class of the training data individually.

c. The misclassification patterns i.e., patterns for which the network output is different from the actual output (for Falsity NN actual output is complementary to the actual output of the Truth NN), are stored in Mtruth and Mfalse.

d. Then we find the intersection of Mtruth and Mfalse. These intersection samples are stored in Mfinal. The patterns in Mfinal, in a sense are output of the aggregated function (U).

This aggregated function is just a combination of intersection over the classification functions (T) and (F).

e. In the final step, we remove the patterns of Mfinal from our training dataset, thus effectively under-sampling our majority class.

4.4.2 Code Snippets for CMTNN Truth Neural Network

# Creating Torch dataset for training Truth NN train_ds_tnn =

torch.utils.data.TensorDataset(torch.tensor(xtrain).float(), torch.tensor(ytrain).float())

valid_ds_tnn = torch.utils.data.TensorDataset(torch.tensor(xtest).float(), torch.tensor(ytest).float())

# Creating Torch dataloader for training Truth NN bs =128 # Batch Size

train_dl_tnn = torch.utils.data.DataLoader(train_ds_tnn, batch_size=bs) valid_dl_tnn = torch.utils.data.DataLoader(valid_ds_tnn, batch_size=bs)

# Static Model Parameters n_input = xtrain.shape[1]

n_output = 1 n_hidden = 20

# Initializing the Truth NN model

(36)

29

TNN =

Classifier(n_input=n_input,n_hidden=n_hidden,n_output=n_output,drop_prob=0 .2)

# Static Training Parameters lr = 0.005 # Learning Rate

pos_weight = torch.tensor([5]) # weight adjustment for class imbalance opt = torch.optim.SGD(TNN.parameters(), lr=lr, momentum=0.9) #Optimizer loss_func = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight) #Loss function

#loss_func = torch.nn.BCELoss()

n_epoch = 50 # No. of epochs to run training

# Start Training

train_loss_tnn, val_loss_tnn =

train(n_epoch,TNN,loss_func,opt,train_dl_tnn,valid_dl_tnn)

In the first step of CMTNN we create Truth NN with original training samples and with a batch size of 128 which directly describes that number of samples in training in one iteration After that the initialization of TNN is done and static training parameters are given such as Learning Rate, Positive weight, Optimizer and Loss function. Also one of the important parameters epochs which describes the number of passes of training data iteration through the neural network is given.

Falsity Neural Network

# Preparing complement targets for "Falsity NN"

ytrain_fnn = 1-ytrain ytest_fnn = 1-ytest

# Creating Torch dataset for training Falsity NN train_ds_fnn =

torch.utils.data.TensorDataset(torch.tensor(xtrain).float(), torch.tensor(ytrain_fnn).float())

valid_ds_fnn = torch.utils.data.TensorDataset(torch.tensor(xtest).float(), torch.tensor(ytest_fnn).float())

# Creating Torch dataloader for training Falsity NN bs =128 # Batch Size

train_dl_fnn = torch.utils.data.DataLoader(train_ds_fnn, batch_size=bs) valid_dl_fnn = torch.utils.data.DataLoader(valid_ds_fnn, batch_size=bs)

(37)

30

# Static Model Parameters n_input = xtrain.shape[1]

n_output = 1 n_hidden = 20

# Initializing the "Falsity NN" model FNN =

Classifier(n_input=n_input,n_hidden=n_hidden,n_output=n_output,drop_prob=0 .2)

# Start Training

train_loss_fnn, val_loss_fnn =

train(n_epoch,FNN,loss_func,opt,train_dl_fnn,valid_dl_fnn)

The second step is similar to Truth Neural Network but here the complements of training targets are given as an input.

Combining Truth Neural Network and Falsity Neural Network

M_intersection = list(set(M_fnn).intersection(M_tnn)) M_union = list(set(M_fnn).union(M_tnn))

xtrain_cm = np.delete(xtrain, M_union, axis=0) ytrain_cm = np.delete(ytrain, M_union, axis=0)

Finally, both the neural networks are combined where intersection and union function is used where the union samples are deleted/removed/ undersampled, only the samples with intersection remain.

4.5 SMOTE Algorithm

One approach to addressing imbalanced datasets is to oversample the minority class. The most straightforward approach involves duplicating examples in the minority class, although these examples do not add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

(38)

31

This technique increases the number of minority class instances by interpolation methods. The minority class instances that lie together are identified before they are employed to form new minority class instances. This technique is able to generate synthetic instances rather than replicate minority class instances; therefore, it can avoid the over-fitting problem.

4.5.1 Implementation of SMOTE algorithm

Synthetic Minority Over-sampling Technique (SMOTE) algorithm applies the KNN approach where it selects K-nearest neighbors, joins them, and creates the synthetic samples in the space (Figure 4.7). The algorithm takes the feature vectors and its nearest neighbors and computes the distance between these vectors. The difference is multiplied by a random number between (0, 1), and it is added back to feature. We have applied SMOTE only on our minority class samples to increase the number of minority class samples and make it equal to the remaining samples in the majority class (remaining after applying CMTNN).

Figure 4.7 Synthetic Minority Over-sampling Technique [23]

It is important to note that in our workflow, we start with a huge bias in the number of samples where the minority class constitutes only 0.172% of the total transaction samples. We then under sample the majority class by applying CMTNN and then oversample the minority class to match the number of samples in the reduced majority class.

(39)

32

To analyze the effect of SMOTE alone, first, we take a fraction of the raw data (we had used 60% of the information for this part). Then, we first generate a scatter plot of this fraction of data before applying SMOTE techniques. This will also form a variation of our "Data Balancing Network".

Figure 4.8 Scatter plot of raw data (before applying SMOTE)

Then we take this fraction of the raw data and run the SMOTE algorithm over it to obtain a balanced class distribution in the dataset's samples. A scatter plot is then generated again to analyse the class distribution of the data graphically.

Figure 4.9 shows a scatter plot for the class distribution of the dataset after restoring class balance on applying SMOTE.

(40)

33

Figure 4.9 Scatter plot of processed data (after applying SMOTE)

From the two scatter plots, it is clearly observable that SMOTE performs reasonably well in generating new samples to increase the number of samples in the minority class.

After our Major focus is done using CMTNN and SMOTE the balanced data is then applied to the classification Algorithms which will be discussed in Section 4.6.

4.6 Classification Algorithms used for final analysis

The next block in our pipeline (after the Data Balancing Network) is the classification algorithms block (Figure 4.10). For our experiments, we had narrowed down our list to five classifiers (SVM, Random Forest [7], XGBoost [10], Logistic Regression, ANN [6]). The data, after being

(41)

34

processed from the Data Balancing Network is then passed to the classification algorithms block.

Figure 4.10 shows the classification algorithms used in the final stage.

Figure 4.10 Classification Algorithms (used for final analysis)

The choice of Classifiers was made based on existing papers that have employed these classifiers for credit card fraud detection or similar tasks, and also on their frequent use in the research community. The ANN classifier follows a simple network architecture of “3 Hidden Layers”, “1 ReLU block”, and “1 Dropout Layer”. The final output of the “Dropout Layer” was passed through a “Sigmoid Layer'' and then fed to “Binary Cross-Entropy Loss Layer." Structurally, the architecture is identical to that of CMTNN Network.

We will keep our focus at presenting our results obtained from this study and extracting meaningful insights. In particular, we will focus on analyzing the improvement in AUC score provided by our “Data Balancing Network” for each of the classification algorithms.

Earlier, we had discussed how we could frame the "fraud detection" problem as a variant of

"Outlier detection". In this section, we will see a more generic and more popular method of framing the "fraud detection" problem as a classification problem. Wherein, it is assumed that you have two classes - "Genuine" and "Fraudulent" and you train a classifier on your dataset to learn how to assign a class to any given transaction event. Supervised learning approaches are best suited for this kind of problem.

In this thesis, we would be utilizing several supervised learning classifiers to demonstrate the effect of our "Data Balancing Network." As discussed above, the supervised learning approach is

(42)

35

a machine learning approach that utilizes training data with the target classes in the data set to produce an inferred function that pairs an input to the desired output value. Popular choices of supervised learning-based classifiers include:

 Support Vector Machine

 Logistic Regression

 Artificial Neural Networks

 Extreme Gradient Boosting

 Random Forest

Support Vector Machine and Neural Networks are one of the popular trainable classifiers for multi-dimensions and continuous features. They are particularly impressive in their ability to extract higher-order hidden patterns and features in the dataset. Neural network models and Support Vector Machine require huge training dataset sizes to achieve their maximum prediction accuracy, which again becomes a bottleneck as most of the data related to fraud detection suffers from the "Imbalance Problem."

4.6.1 Support Vector Machines (SVM)

In 1995, Cortes and Vapnik first introduced Support Vector Machines (SVM). This Machine Learning algorithm discovers a special kind of linear model, the maximum margin hyperplane.

The training instances are correctly classified by the separation of classes into a search space using hyperplane. The maximum margin hyperplane (MMH) is the one that gives the most significant separation between the classes – it comes no closer to any of the classes than it has to.

The instances that are closest to the maximum margin hyperplane – the ones with minimum distance to it – are called support vectors. There is always minimum one support vector for each class, and often there are more. The optimal hyperplane is found by maximizing the width of the margin. As shown in Figure 4.11 below, the margin is the distance between the separating hyperplane and the closest positive class and negative class

(43)

36

Figure 4.11 Maximum Margin Hyperplanes (MMH) separating two classes in SVM [24]

In situations that the classes are not entirely separable, the SVM algorithm finds the hyperplane that maximizes the margin while minimizing the misclassified instances.

There is some possibility where a nonlinear region can separate the classes more effectively.

Instead of fitting nonlinear curves to the data, SVM determines a dividing line with the help of a kernel function to map the data into a different space where a hyperplane can be used to do a linear separation. Kernel mapping function is compelling because it allows SVM models to perform separations even with very complex boundaries. An infinite number of kernel mapping functions can be used. Still, the Radial Basis Function (RBF) has been found to work well for a wide variety of applications, including credit card fraud [27].The transformation to a high- dimensional space is done by replacing every dot product in the SVM algorithm with the Gaussian radial basis (GRB) function kernel as follows:

, exp ∥ ∥ , 0

, .

where , is the kernel function, and φ(x) is the transformation function [28].

(44)

37 4.6.2 Logistic Regression

When the dependent variable takes only two values, and the independent variables are continuous, categorical, or both that’s when Logistic regression is used. The logistic regression method is ideal when classifying outcomes that only have two values because the logistic curve is limited to values between 0 and 1.

In credit card fraud detection, the dependent variable would take on a value of 0 (legitimate transaction) or 1 (fraudulent transaction). Unlike ordinary linear regression, however, logistic regression does not assume a linear relationship between the independent variables and the dependent variable. Nor does it imply that the dependent variable or the error terms are distributed normally.

We can call a Logistic Regression a Linear Regression model, but the Logistic Regression uses a more complex cost function, this cost function can be defined as the 'Sigmoid function' or also known as the 'logistic function' instead of a linear function. To map predicted values to probabilities, we use the Sigmoid function. The feature maps any real value into another value between 0 and 1. For our particular case, Logistic Regression would output probability values indicating the probability of a transaction belonging to the "Genuine" class or "Fraudulent" class.

(45)

38

Figure 4.12 Logistic or Sigmoid function [25]

Figure 4.12 shows a graphical representation of the "Sigmoid" function. Thus, the output of Logistic Regression is mapped to a value lying between 0 and 1. They are thus resulting in a probability of belonging to a particular class.

4.6.3 Neural Networks

Artificial Neural Networks (ANN) are computational models that try to mimic our bodies' biological neural networks and quickly adapt to change. This mathematical model consists of interconnected artificial neurons (nodes) that can receive one or more inputs and sums them to produce a prediction (output). A neuron has two modes of operation: training mode, and usage mode. In training mode, the neuron can be taught to associate a particular prediction with an input pattern. While in usage mode, if the neuron detects a taught input pattern its associated prediction is outputted.

(46)

39

Figure 4.13 Neural Network Architecture with 1-Hidden Layer [26]

A neural network consists of three essential layers:

Input Layer: As the name suggests, this layer accepts all the inputs provided by the programmer.

Hidden Layer: Layer between the input and the output layer is a set of layers known as Hidden layers. In this layer, computations are performed, which result in the output.

Output Layer: The inputs go through a series of transformations via the hidden layer which finally results in the output generation at this layer.

The effect of each input's contribution to the final prediction is dependent on the weight of the particular input. To determine a neural network that is an accurate predictor, appropriate weights for the connections must be determined. The most famous method used to determine the optimal connection weights is called backpropagation. This method was introduced by Rumelhart et al.

[32], and through their work artificial neural network research gained recognition in machine learning.

With the evolution of the technological landscape, computation has become more affordable and viable over the years, which has given a significant boost to the machine learning industry. This has also led to a multitude of innovations in the Neural Network space itself. Deeper and more