TY - JOUR
T1 - Noise-free sampling with majority framework for an imbalanced classification problem
AU - Firdausanti, Neni Alya
AU - Mendonça, Israel
AU - Aritsugi, Masayoshi
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
PY - 2024/7
Y1 - 2024/7
N2 - Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.
AB - Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.
KW - Classification
KW - Imbalance
KW - Machine learning
KW - Noisy data
KW - Synthetic oversampling
UR - http://www.scopus.com/inward/record.url?scp=85190303709&partnerID=8YFLogxK
U2 - 10.1007/s10115-024-02079-6
DO - 10.1007/s10115-024-02079-6
M3 - Article
AN - SCOPUS:85190303709
SN - 0219-1377
VL - 66
SP - 4011
EP - 4042
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 7
ER -