TY - GEN
T1 - Analyzing Oversampling and Machine Learning Approaches for Imbalanced Dataset Classification
AU - Navastara, Dini Adni
AU - Fatichah, Chastine
AU - Niza, Yulia
AU - Sari, Fiqey Indriati Eka
AU - Jalil, Muchamad Maroqi Abdul
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Imbalanced data, characterized by a substantial difference in data distribution between majority and minority classes, poses a critical challenge in predictive modeling. This disparity often leads to the misclassification of the minority class, which may contain vital information for real-world applications. Consequently, addressing imbalanced data is paramount, given its potential repercussions in critical classification scenarios. In this study, we conducted a comprehensive analysis of oversampling and ensemble learning techniques to mitigate imbalanced data issues. Through an extensive evaluation process employing confusion matrices, we measured the performance of these methods across various binary datasets. Remarkably, our findings showcased the efficacy of these techniques, with standout results such as a recall score of 0.6883 for the Haberman's Survival dataset using the KNN classification method in conjunction with Borderline-SMOTE oversampling, a recall score of 0.8391 for the COVID19 dataset with the KNN classification method and SMOTE oversampling, and an impressive recall value of 0.9476 for the Credit Card Fraud dataset when applying the XGBoost classification method and ROS oversampling. It is important to note that the performance outcomes are intrinsically tied to the unique characteristics of each dataset. This study provides insights for handling imbalanced data based on dataset characteristics, aiding predictive modeling in real-world scenarios.
AB - Imbalanced data, characterized by a substantial difference in data distribution between majority and minority classes, poses a critical challenge in predictive modeling. This disparity often leads to the misclassification of the minority class, which may contain vital information for real-world applications. Consequently, addressing imbalanced data is paramount, given its potential repercussions in critical classification scenarios. In this study, we conducted a comprehensive analysis of oversampling and ensemble learning techniques to mitigate imbalanced data issues. Through an extensive evaluation process employing confusion matrices, we measured the performance of these methods across various binary datasets. Remarkably, our findings showcased the efficacy of these techniques, with standout results such as a recall score of 0.6883 for the Haberman's Survival dataset using the KNN classification method in conjunction with Borderline-SMOTE oversampling, a recall score of 0.8391 for the COVID19 dataset with the KNN classification method and SMOTE oversampling, and an impressive recall value of 0.9476 for the Credit Card Fraud dataset when applying the XGBoost classification method and ROS oversampling. It is important to note that the performance outcomes are intrinsically tied to the unique characteristics of each dataset. This study provides insights for handling imbalanced data based on dataset characteristics, aiding predictive modeling in real-world scenarios.
KW - Classification
KW - Ensemble Learning
KW - Imbalanced Data
KW - Machine Learning
KW - Oversampling
UR - http://www.scopus.com/inward/record.url?scp=85197664251&partnerID=8YFLogxK
U2 - 10.1109/SCOReD60679.2023.10563710
DO - 10.1109/SCOReD60679.2023.10563710
M3 - Conference contribution
AN - SCOPUS:85197664251
T3 - 2023 IEEE 21st Student Conference on Research and Development, SCOReD 2023
SP - 27
EP - 32
BT - 2023 IEEE 21st Student Conference on Research and Development, SCOReD 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 21st IEEE Student Conference on Research and Development, SCOReD 2023
Y2 - 13 December 2023 through 14 December 2023
ER -