TY - JOUR
T1 - Comparison between selective sampling and random undersampling for classification of customer defection using support vector machine
AU - Kuswanto, Heri
AU - Sarumaha, Yogi
AU - Ohwada, Hayato
N1 - Publisher Copyright:
© 2017 Heri Kuswanto, Yogi Sarumaha and Hayato Ohwada.
PY - 2017
Y1 - 2017
N2 - Quality of a product determines the customer loyalty and it can be measured by conducting a survey. Company 'X' that sells three kinds of product (low, medium and high price) collected very large dataset through an online survey and recorded customer defection and their characteristic. The measured variables are Update Accumulation, Product Price, Customer Type, Delivery Status and Customer Defection. The data has an imbalanced response that could mislead the accuracy of classification if it is analyzed using standard approaches. Selective Sampling (SS) and Random Undersampling (RU) have been applied to draw a sample from imbalance response in order to obtain more balance data. Furthermore, Support Vector Machine (SVM) has been applied to classify the sampled data. The performance of the SS-SVM and SS-RU to classify sampled data has been evaluated and compared with the result of classifying the raw dataset. The RU yields on exact balance (50%:50%) response class, while SS reduce the imbalance proportion significantly (around 52%:48%). Nevertheless, the SS-SVM outperforms RU-SVM in the sense that it is capable to run the process effectively, where the SS-SVM reduces the duration of classification process 3 to 20 h shorter than using RU-SVM, with slightly different accuracy rate. Moreover, the SS-SVM maintains the basic characteristics of raw data better than RU-SVM.
AB - Quality of a product determines the customer loyalty and it can be measured by conducting a survey. Company 'X' that sells three kinds of product (low, medium and high price) collected very large dataset through an online survey and recorded customer defection and their characteristic. The measured variables are Update Accumulation, Product Price, Customer Type, Delivery Status and Customer Defection. The data has an imbalanced response that could mislead the accuracy of classification if it is analyzed using standard approaches. Selective Sampling (SS) and Random Undersampling (RU) have been applied to draw a sample from imbalance response in order to obtain more balance data. Furthermore, Support Vector Machine (SVM) has been applied to classify the sampled data. The performance of the SS-SVM and SS-RU to classify sampled data has been evaluated and compared with the result of classifying the raw dataset. The RU yields on exact balance (50%:50%) response class, while SS reduce the imbalance proportion significantly (around 52%:48%). Nevertheless, the SS-SVM outperforms RU-SVM in the sense that it is capable to run the process effectively, where the SS-SVM reduces the duration of classification process 3 to 20 h shorter than using RU-SVM, with slightly different accuracy rate. Moreover, the SS-SVM maintains the basic characteristics of raw data better than RU-SVM.
KW - Defection
KW - Imbalance
KW - SVM
KW - Sampling
UR - http://www.scopus.com/inward/record.url?scp=85029762860&partnerID=8YFLogxK
U2 - 10.3844/jcssp.2017.355.362
DO - 10.3844/jcssp.2017.355.362
M3 - Article
AN - SCOPUS:85029762860
SN - 1549-3636
VL - 13
SP - 355
EP - 362
JO - Journal of Computer Science
JF - Journal of Computer Science
IS - 8
ER -