Enhancing the performance of smote algorithm by using attribute weighting scheme and new selective sampling method for imbalanced data set

Research output: Contribution to journalArticlepeer-review

19 Citations (Scopus)

Abstract

SMOTE is one of the well-known algorithms for balancing train data by adding synthetic data on minor class data. One of the stages in SMOTE is finding the nearest neighbors (kNN) as the basis for creating synthetic data using Euclidean dis- tance. In cases where a small number of attributes having high correlation value than others, finding kNN using Euclidean without considering this correlation may not find representative neighbors. This paper introduces AWH-SMOTE (Attribute Weighted and kNN Hub on SMOTE), which enhances SMOTE in improving neighbors and noise identification using attribute weighting and also improving selective sampling method using occurrence data in the kNN hub. Wojna and Information Gain methods are used for attribute weighting. A small number of occurrences in the kNN hub results in more syn- thetic data generated so that minority data in dangerous region are more represented. Nine public datasets from Keel repository are used to evaluate AWH-SMOTE. Evaluation shows AWH-SMOTE has better performance on minority precision and minority f-measure for both pruned and unpruned condition than other oversampling algorithms. Information Gain as attribute weighting method in AWH-SMOTE achieves best perfor- mance in unpruned condition when compared to other weighting methods for minority recall, minority precision and minority f-measure.

Original languageEnglish
Pages (from-to)423-444
Number of pages22
JournalInternational Journal of Innovative Computing, Information and Control
Volume15
Issue number2
DOIs
Publication statusPublished - Apr 2019

Keywords

  • AWH-SMOTE
  • Attribute weighting
  • Information gain
  • Noise
  • Wojna
  • kNN hub

Fingerprint

Dive into the research topics of 'Enhancing the performance of smote algorithm by using attribute weighting scheme and new selective sampling method for imbalanced data set'. Together they form a unique fingerprint.

Cite this