Enhancing Prediction Accuracy in an Imbalanced Dataset of Dengue Infection Cases Using a Two-layer Ensemble Outlier Detection and Feature Selection Technique

Amiq Fahmi, Diana Purwitasari, Surya Sumpeno, Mauridhi Hery Purnomo*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Real-world datasets frequently compromise considerably on noise, resulting in the emergence of outlier data. Detecting and removing outliers in large and imbalanced datasets is a challenging and exciting study in machine learning, especially in healthcare, for accurate prediction. Therefore, it is essential to handle outliers properly, as their presence in classification datasets leads to more difficult, inaccurate, and lower predictive modelling performance. The study proposes methods to enhance prediction accuracy in an imbalanced real-world health dataset of dengue infection cases. First, use a two-layer ensemble method called IsFLOF, which involves an isolation forest (IsF) and a local outlier factor (LOF) to find and accurately eliminate global and local outliers. This approach overcomes the limitations of the IsF algorithm, which is only sensitive to global outliers but vulnerable to local outliers, while LOF excels in local outlier detection but has high complexity. Second, once a dataset with correctly measured value distributions was obtained by eliminating outliers, a resampling process was conducted to prevent prediction bias caused by imbalanced instance data in the multi-class setting. Subsequently, insignificant features were filtered out to further refine the dataset. In the end, eight machine learning algorithms are used to test the robustness and effectiveness of the proposed method. The experimental results showed that the AdaBoost classifier, combined with selected features from the Fast Correlation-Based Filter (FCBF), achieved 93.5% and 95.1% accuracy in training and testing, respectively. In a more distant context, the proposed method is tested and compared with recent methods, including using a public dataset of imbalanced hypothyroid cases. It showed higher and more acceptable prediction accuracy than the original and synthetic data.

Original languageEnglish
Pages (from-to)544-560
Number of pages17
JournalInternational Journal of Intelligent Engineering and Systems
Volume17
Issue number2
DOIs
Publication statusPublished - 2024

Keywords

  • Classification accuracy
  • Dengue infection cases
  • Feature selection
  • Imbalanced dataset
  • Outlier detection
  • Resampling

Fingerprint

Dive into the research topics of 'Enhancing Prediction Accuracy in an Imbalanced Dataset of Dengue Infection Cases Using a Two-layer Ensemble Outlier Detection and Feature Selection Technique'. Together they form a unique fingerprint.

Cite this