TY - GEN
T1 - Performance Analysis of Resampling and Ensemble Learning Methods on Diabetes Detection as Imbalanced Dataset
AU - Sari, Fiqey Indriati Eka
AU - Edlim, Frederick William
AU - Ramadhan, Fitrah Arie
AU - Muhtadin,
AU - Navastara, Dini Adni
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Early detection of diabetes is essential to reducing a high mortality rate. Early detection can be made by studying the possibility of diabetes from the variables obtained in the data of diabetes patients. How to diagnose a patient with medical data becomes a challenge because these are usually imbalanced, where negative cases severely outnumber positive cases. For preprocessing the imbalanced data, this paper designs an algorithm using resampling techniques combined with an ensemble learning algorithm. There are some oversampling techniques ADASYN, ROS, and SMOTE. Whereas, the undersampling techniques are RUS, Tomek, and ENN. The combined techniques like SMOTE-ENN and SMOTE-Tomek are also used to handle highly imbalanced dataset diabetes. Then, the ensemble learning algorithm that is used is Random Forest, Bagging, AdaBoost, and XGBoost. Based on the experimental results, the best performance is using SMOTE-ENN with AdaBoost, with a recall score of 0.7330 even though the F1-Score of this model is 0.6459. AdaBoost Classifier also has good and stable results with various types of resampling. By using SMOTE-ENN, the recall score of the model increased by 0.1819 and the F1 score decreased by 0.2000 from the original model result. The higher sensitivity/recall is more important in medical diagnoses to correctly identify patients with disease than the F1 Score.
AB - Early detection of diabetes is essential to reducing a high mortality rate. Early detection can be made by studying the possibility of diabetes from the variables obtained in the data of diabetes patients. How to diagnose a patient with medical data becomes a challenge because these are usually imbalanced, where negative cases severely outnumber positive cases. For preprocessing the imbalanced data, this paper designs an algorithm using resampling techniques combined with an ensemble learning algorithm. There are some oversampling techniques ADASYN, ROS, and SMOTE. Whereas, the undersampling techniques are RUS, Tomek, and ENN. The combined techniques like SMOTE-ENN and SMOTE-Tomek are also used to handle highly imbalanced dataset diabetes. Then, the ensemble learning algorithm that is used is Random Forest, Bagging, AdaBoost, and XGBoost. Based on the experimental results, the best performance is using SMOTE-ENN with AdaBoost, with a recall score of 0.7330 even though the F1-Score of this model is 0.6459. AdaBoost Classifier also has good and stable results with various types of resampling. By using SMOTE-ENN, the recall score of the model increased by 0.1819 and the F1 score decreased by 0.2000 from the original model result. The higher sensitivity/recall is more important in medical diagnoses to correctly identify patients with disease than the F1 Score.
KW - diabetes
KW - ensemble learning
KW - imbalanced dataset
KW - resampling
UR - http://www.scopus.com/inward/record.url?scp=85142417051&partnerID=8YFLogxK
U2 - 10.1109/ICVEE57061.2022.9930467
DO - 10.1109/ICVEE57061.2022.9930467
M3 - Conference contribution
AN - SCOPUS:85142417051
T3 - 2022 5th International Conference on Vocational Education and Electrical Engineering: The Future of Electrical Engineering, Informatics, and Educational Technology Through the Freedom of Study in the Post-Pandemic Era, ICVEE 2022 - Proceeding
SP - 1
EP - 5
BT - 2022 5th International Conference on Vocational Education and Electrical Engineering
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Vocational Education and Electrical Engineering, ICVEE 2022
Y2 - 10 September 2022 through 11 September 2022
ER -