HS-CGK: A Hybrid Sampling Method for Imbalance Data Based on Conditional Tabular Generative Adversarial Network and K-Nearest Neighbor Algorithm

Authors

  • Xiaoyan Zhao School of Information and Electronic Engineering, Shandong Technology and Business University, 264005 Yantai, China
  • Shaopeng Guan School of Information and Electronic Engineering, Shandong Technology and Business University, 264005 Yantai, China
  • Yuewei Xue School of Information and Electronic Engineering, Shandong Technology and Business University, 264005 Yantai, China
  • Hao Pan School of Information and Electronic Engineering, Shandong Technology and Business University, 264005 Yantai, China

DOI:

https://doi.org/10.31577/cai_2024_1_213

Keywords:

Imbalanced data, class overlap, conditional tabular generative adversarial network, K-nearest neighbor algorithm, hybrid sampling

Abstract

Class imbalance problem in datasets can lead to biased classification decisions in favor of majority class samples. Additionally, class overlap can cause fuzzy classification boundaries, affecting the performance of classification algorithms. To address these issues, we propose a hybrid sampling method based on conditional tabular generative adversarial network (CTGAN) and K-nearest neighbor (KNN) algorithm. Firstly, we introduce an oversampling algorithm, named DB-CTGAN, based on CTGAN. This algorithm filters noisy and boundary samples using the density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm and generates synthetic samples that conform to the real data distribution using CTGAN. Finally, we combine the expanded fraudulent samples generated by DB-CTGAN with the normal samples and use the KNN overlap undersampling algorithm to remove the samples in the overlap region, solving the class overlap problem. Experimental results show that compared with eight sampling methods using four standard classification models (Random Forest, Decision Tree, Support Vector Classification, and XGBoost), the proposed method significantly improves the F1, AUC, and G-mean metrics on five real datasets.

Downloads

Download data is not yet available.

Downloads

Published

2024-04-29

How to Cite

Zhao, X., Guan, S., Xue, Y., & Pan, H. (2024). HS-CGK: A Hybrid Sampling Method for Imbalance Data Based on Conditional Tabular Generative Adversarial Network and K-Nearest Neighbor Algorithm. COMPUTING AND INFORMATICS, 43(1), 213–239. https://doi.org/10.31577/cai_2024_1_213