New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

Esraa Faisal Malik; Khai Wah Khaw; XinYing Chew

doi:10.31577/cai_2022_4_981

Authors

Esraa Faisal Malik School of Management, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia
Khai Wah Khaw School of Management, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia
XinYing Chew School of Computer Science, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia

DOI:

https://doi.org/10.31577/cai_2022_4_981

Keywords:

Cost-sensitive learning, hybrid, imbalance dataset, resampling techniques

Abstract

One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction.

Downloads

Download data is not yet available.

New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Information

Make a Submission

Keywords