Noise-free sampling with majority framework for an imbalanced classification problem

Neni Alya Firdausanti*, Israel Mendonça, Masayoshi Aritsugi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.

Original languageEnglish
Pages (from-to)4011-4042
Number of pages32
JournalKnowledge and Information Systems
Volume66
Issue number7
DOIs
Publication statusPublished - Jul 2024
Externally publishedYes

Keywords

  • Classification
  • Imbalance
  • Machine learning
  • Noisy data
  • Synthetic oversampling

Fingerprint

Dive into the research topics of 'Noise-free sampling with majority framework for an imbalanced classification problem'. Together they form a unique fingerprint.

Cite this