An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset

Prasetyo Wibowo, Chastine Fatichah

Research output: Contribution to journalArticlepeer-review

9 Citations (Scopus)

Abstract

Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of highclass imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class. Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROCAUC, and 0.8581 PRAUC of the overall oversampling technique.

Original languageEnglish
Pages (from-to)63-71
Number of pages9
JournalRegister: Jurnal Ilmiah Teknologi Sistem Informasi
Volume7
Issue number1
DOIs
Publication statusPublished - 2021

Keywords

  • Classification
  • Imbalanced dataset
  • Oversampling
  • Performance analysis

Fingerprint

Dive into the research topics of 'An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset'. Together they form a unique fingerprint.

Cite this