TY - JOUR
T1 - Enhancing Prediction Accuracy in an Imbalanced Dataset of Dengue Infection Cases Using a Two-layer Ensemble Outlier Detection and Feature Selection Technique
AU - Fahmi, Amiq
AU - Purwitasari, Diana
AU - Sumpeno, Surya
AU - Purnomo, Mauridhi Hery
N1 - Publisher Copyright:
© (2024), (Intelligent Network and Systems Society). All Rights Reserved.
PY - 2024
Y1 - 2024
N2 - Real-world datasets frequently compromise considerably on noise, resulting in the emergence of outlier data. Detecting and removing outliers in large and imbalanced datasets is a challenging and exciting study in machine learning, especially in healthcare, for accurate prediction. Therefore, it is essential to handle outliers properly, as their presence in classification datasets leads to more difficult, inaccurate, and lower predictive modelling performance. The study proposes methods to enhance prediction accuracy in an imbalanced real-world health dataset of dengue infection cases. First, use a two-layer ensemble method called IsFLOF, which involves an isolation forest (IsF) and a local outlier factor (LOF) to find and accurately eliminate global and local outliers. This approach overcomes the limitations of the IsF algorithm, which is only sensitive to global outliers but vulnerable to local outliers, while LOF excels in local outlier detection but has high complexity. Second, once a dataset with correctly measured value distributions was obtained by eliminating outliers, a resampling process was conducted to prevent prediction bias caused by imbalanced instance data in the multi-class setting. Subsequently, insignificant features were filtered out to further refine the dataset. In the end, eight machine learning algorithms are used to test the robustness and effectiveness of the proposed method. The experimental results showed that the AdaBoost classifier, combined with selected features from the Fast Correlation-Based Filter (FCBF), achieved 93.5% and 95.1% accuracy in training and testing, respectively. In a more distant context, the proposed method is tested and compared with recent methods, including using a public dataset of imbalanced hypothyroid cases. It showed higher and more acceptable prediction accuracy than the original and synthetic data.
AB - Real-world datasets frequently compromise considerably on noise, resulting in the emergence of outlier data. Detecting and removing outliers in large and imbalanced datasets is a challenging and exciting study in machine learning, especially in healthcare, for accurate prediction. Therefore, it is essential to handle outliers properly, as their presence in classification datasets leads to more difficult, inaccurate, and lower predictive modelling performance. The study proposes methods to enhance prediction accuracy in an imbalanced real-world health dataset of dengue infection cases. First, use a two-layer ensemble method called IsFLOF, which involves an isolation forest (IsF) and a local outlier factor (LOF) to find and accurately eliminate global and local outliers. This approach overcomes the limitations of the IsF algorithm, which is only sensitive to global outliers but vulnerable to local outliers, while LOF excels in local outlier detection but has high complexity. Second, once a dataset with correctly measured value distributions was obtained by eliminating outliers, a resampling process was conducted to prevent prediction bias caused by imbalanced instance data in the multi-class setting. Subsequently, insignificant features were filtered out to further refine the dataset. In the end, eight machine learning algorithms are used to test the robustness and effectiveness of the proposed method. The experimental results showed that the AdaBoost classifier, combined with selected features from the Fast Correlation-Based Filter (FCBF), achieved 93.5% and 95.1% accuracy in training and testing, respectively. In a more distant context, the proposed method is tested and compared with recent methods, including using a public dataset of imbalanced hypothyroid cases. It showed higher and more acceptable prediction accuracy than the original and synthetic data.
KW - Classification accuracy
KW - Dengue infection cases
KW - Feature selection
KW - Imbalanced dataset
KW - Outlier detection
KW - Resampling
UR - http://www.scopus.com/inward/record.url?scp=85188175562&partnerID=8YFLogxK
U2 - 10.22266/ijies2024.0430.44
DO - 10.22266/ijies2024.0430.44
M3 - Article
AN - SCOPUS:85188175562
SN - 2185-310X
VL - 17
SP - 544
EP - 560
JO - International Journal of Intelligent Engineering and Systems
JF - International Journal of Intelligent Engineering and Systems
IS - 2
ER -