TY - GEN
T1 - Feature Selection and Sensitivity Analysis of Oversampling in Big and Highly Imbalanced Bank's Credit Data
AU - Kurniawan, Aznovri
AU - Rifa'i, Ahmad
AU - Nafis, Moch Abdillah
AU - Andriaswuri, Nimas Sefrida
AU - Patria, Harry
AU - Purwitasari, Diana
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Machine learning has evolved as a multidisciplinary study in the last few years and gains more popularity in big data analytics, including in the banking industry. Numerous methods can be used in predictive analytics through supervised machine learning, either for regression or classification problems. In the banking industry, credit quality is one of the core focuses, since it is one of the main areas that is reviewed regularly by regulators and impacts banks' profitability. This research is intended to give recommendations on how to select appropriate machine learning technique, perform feature selection and sensitivity analysis on bank's credit data with more than one million records and highly imbalanced, i.e., 97.5% of data is at one category. By using several supervised machine learning classification methods including the application of SMOTE (synthetic minority oversampling technique), computational results are compared and summarized, resulting in recommendations on the most appropriate technique for big and extremely imbalanced datasets, i.e., the Tree Ensemble method with SMOTE, with the computational issue is solved through data sampling, without significantly reducing its accuracy. It is also concluded that optimum number of features will increase model accuracy, however significant reduction of number of features will not necessarily increase model accuracy. The research is expected to be useful for the banking industry, especially in credit portfolio analytics, or other industries with a big and imbalanced dataset, to perform predictive analytics to support business objectives. Further research is possible, to cover more in-depth analytics for the decision-making process in banking.
AB - Machine learning has evolved as a multidisciplinary study in the last few years and gains more popularity in big data analytics, including in the banking industry. Numerous methods can be used in predictive analytics through supervised machine learning, either for regression or classification problems. In the banking industry, credit quality is one of the core focuses, since it is one of the main areas that is reviewed regularly by regulators and impacts banks' profitability. This research is intended to give recommendations on how to select appropriate machine learning technique, perform feature selection and sensitivity analysis on bank's credit data with more than one million records and highly imbalanced, i.e., 97.5% of data is at one category. By using several supervised machine learning classification methods including the application of SMOTE (synthetic minority oversampling technique), computational results are compared and summarized, resulting in recommendations on the most appropriate technique for big and extremely imbalanced datasets, i.e., the Tree Ensemble method with SMOTE, with the computational issue is solved through data sampling, without significantly reducing its accuracy. It is also concluded that optimum number of features will increase model accuracy, however significant reduction of number of features will not necessarily increase model accuracy. The research is expected to be useful for the banking industry, especially in credit portfolio analytics, or other industries with a big and imbalanced dataset, to perform predictive analytics to support business objectives. Further research is possible, to cover more in-depth analytics for the decision-making process in banking.
KW - credit data
KW - feature selection
KW - imbalanced dataset
KW - machine learning
KW - sensitivity analysis
UR - http://www.scopus.com/inward/record.url?scp=85141576783&partnerID=8YFLogxK
U2 - 10.1109/ICoICT55009.2022.9914889
DO - 10.1109/ICoICT55009.2022.9914889
M3 - Conference contribution
AN - SCOPUS:85141576783
T3 - 2022 10th International Conference on Information and Communication Technology, ICoICT 2022
SP - 35
EP - 40
BT - 2022 10th International Conference on Information and Communication Technology, ICoICT 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th International Conference on Information and Communication Technology, ICoICT 2022
Y2 - 2 August 2022 through 3 August 2022
ER -