Abstract
SMOTE is one of the well-known algorithms for balancing train data by adding synthetic data on minor class data. One of the stages in SMOTE is finding the nearest neighbors (kNN) as the basis for creating synthetic data using Euclidean dis- tance. In cases where a small number of attributes having high correlation value than others, finding kNN using Euclidean without considering this correlation may not find representative neighbors. This paper introduces AWH-SMOTE (Attribute Weighted and kNN Hub on SMOTE), which enhances SMOTE in improving neighbors and noise identification using attribute weighting and also improving selective sampling method using occurrence data in the kNN hub. Wojna and Information Gain methods are used for attribute weighting. A small number of occurrences in the kNN hub results in more syn- thetic data generated so that minority data in dangerous region are more represented. Nine public datasets from Keel repository are used to evaluate AWH-SMOTE. Evaluation shows AWH-SMOTE has better performance on minority precision and minority f-measure for both pruned and unpruned condition than other oversampling algorithms. Information Gain as attribute weighting method in AWH-SMOTE achieves best perfor- mance in unpruned condition when compared to other weighting methods for minority recall, minority precision and minority f-measure.
Original language | English |
---|---|
Pages (from-to) | 423-444 |
Number of pages | 22 |
Journal | International Journal of Innovative Computing, Information and Control |
Volume | 15 |
Issue number | 2 |
DOIs | |
Publication status | Published - Apr 2019 |
Keywords
- AWH-SMOTE
- Attribute weighting
- Information gain
- Noise
- Wojna
- kNN hub