TY - JOUR
T1 - An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset
AU - Wibowo, Prasetyo
AU - Fatichah, Chastine
N1 - Publisher Copyright:
© 2021, the author(s).
PY - 2021
Y1 - 2021
N2 - Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of highclass imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class. Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROCAUC, and 0.8581 PRAUC of the overall oversampling technique.
AB - Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of highclass imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class. Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROCAUC, and 0.8581 PRAUC of the overall oversampling technique.
KW - Classification
KW - Imbalanced dataset
KW - Oversampling
KW - Performance analysis
UR - http://www.scopus.com/inward/record.url?scp=85102240982&partnerID=8YFLogxK
U2 - 10.26594/register.v7i1.2206
DO - 10.26594/register.v7i1.2206
M3 - Article
AN - SCOPUS:85102240982
SN - 2503-0477
VL - 7
SP - 63
EP - 71
JO - Register: Jurnal Ilmiah Teknologi Sistem Informasi
JF - Register: Jurnal Ilmiah Teknologi Sistem Informasi
IS - 1
ER -