TY - GEN
T1 - Diabetes Prediction in Machine Learning using Feature Selection
T2 - 7th International Conference on Informatics and Computational Sciences, ICICoS 2024
AU - Asy'ari, Zulchair
AU - Hidayati, Shintami Chusnul
AU - Sarno, Riyanarto
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Diabetes poses a global threat, impacting both patient well-being and healthcare resources. Preventing diabetes has become a key focus in mitigating its impact in the medical field. This research leverages machine learning to predict diabetes in patients by utilizing a dataset of diabetic and non-diabetic individuals to identify patterns indicative of diabetes. The study uses the Diabetes 130-US hospitals dataset, covering the years 1999-2008, and analyzes its features, applies data preprocessing, selects relevant features, and addresses data imbalance through various sampling techniques to enhance prediction accuracy. The machine learning models employed in this research include Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM). The findings highlight the importance of feature selection in optimizing model performance, identifying key features to improve accuracy, and assessing the effectiveness of each model. A comparative analysis of the models using different sampling techniques demonstrates the exceptional performance of the Random Forest model, achieving 99.8% accuracy with the raw dataset. However, Logistic Regression shows potential for improvement, as its performance increases with the combination of various techniques, indicating its value for future enhancement of predictive capabilities.
AB - Diabetes poses a global threat, impacting both patient well-being and healthcare resources. Preventing diabetes has become a key focus in mitigating its impact in the medical field. This research leverages machine learning to predict diabetes in patients by utilizing a dataset of diabetic and non-diabetic individuals to identify patterns indicative of diabetes. The study uses the Diabetes 130-US hospitals dataset, covering the years 1999-2008, and analyzes its features, applies data preprocessing, selects relevant features, and addresses data imbalance through various sampling techniques to enhance prediction accuracy. The machine learning models employed in this research include Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, and Support Vector Machine (SVM). The findings highlight the importance of feature selection in optimizing model performance, identifying key features to improve accuracy, and assessing the effectiveness of each model. A comparative analysis of the models using different sampling techniques demonstrates the exceptional performance of the Random Forest model, achieving 99.8% accuracy with the raw dataset. However, Logistic Regression shows potential for improvement, as its performance increases with the combination of various techniques, indicating its value for future enhancement of predictive capabilities.
KW - comparative analysis
KW - diabetes prediction
KW - feature selection
KW - machine learning
KW - sampling technique
UR - http://www.scopus.com/inward/record.url?scp=85202850054&partnerID=8YFLogxK
U2 - 10.1109/ICICoS62600.2024.10636933
DO - 10.1109/ICICoS62600.2024.10636933
M3 - Conference contribution
AN - SCOPUS:85202850054
T3 - Proceedings - International Conference on Informatics and Computational Sciences
SP - 221
EP - 226
BT - 2024 7th International Conference on Informatics and Computational Sciences, ICICoS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 July 2024 through 18 July 2024
ER -