TY - GEN
T1 - The Effects of Feature Selection and Balancing Dataset to Improve IoT-Based IDS Using Machine Learning
AU - Putro, Iwan Handoyo
AU - Ahmad, Tohari
AU - Ijtihadie, Royyana Muslim
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The amount of data has increased dramatically over the past decade, which has made classifying the data more complicated, mainly when the data contains an uneven distribution, which indicates that one majority class has more instances than others. Standard classifiers can disregard the minority class entirely in this situation and have a tendency to classify all samples as belonging to the majority class. In the area of IDS study research, this imbalanced dataset is frequently found, as naturally, the amount of benign traffic outweighs the number of cyber threats. Researchers often implement the SMOTE technique for this challenge, particularly the over-sampling method. Nonetheless, other SMOTE techniques, namely undersampling and combined sampling, are rarely implemented. In this work, we attempt to balance the RT IoT 2022 dataset with wrapper feature selection and to-fold cross-validation. The result was then evaluated using several machine learning classifiers, including K-nearest neighbors, Naive Bayes, Decision Tree, Random Forest, Support Vector Machines, and Adaptive Boosting. The result indicates that in terms of accuracy, Random Forest exceeds other classifiers in either oversampling, undersampling, or combined experiments at 99.09%, 98.65%, and 99.97%, respectively.
AB - The amount of data has increased dramatically over the past decade, which has made classifying the data more complicated, mainly when the data contains an uneven distribution, which indicates that one majority class has more instances than others. Standard classifiers can disregard the minority class entirely in this situation and have a tendency to classify all samples as belonging to the majority class. In the area of IDS study research, this imbalanced dataset is frequently found, as naturally, the amount of benign traffic outweighs the number of cyber threats. Researchers often implement the SMOTE technique for this challenge, particularly the over-sampling method. Nonetheless, other SMOTE techniques, namely undersampling and combined sampling, are rarely implemented. In this work, we attempt to balance the RT IoT 2022 dataset with wrapper feature selection and to-fold cross-validation. The result was then evaluated using several machine learning classifiers, including K-nearest neighbors, Naive Bayes, Decision Tree, Random Forest, Support Vector Machines, and Adaptive Boosting. The result indicates that in terms of accuracy, Random Forest exceeds other classifiers in either oversampling, undersampling, or combined experiments at 99.09%, 98.65%, and 99.97%, respectively.
KW - IDS
KW - RT IoT 2022
KW - SMOTE
KW - cross-validation
KW - imbalanced dataset
KW - machine learning
UR - https://www.scopus.com/pages/publications/105002211772
U2 - 10.1109/ONCON62778.2024.10931560
DO - 10.1109/ONCON62778.2024.10931560
M3 - Conference contribution
AN - SCOPUS:105002211772
T3 - 2024 IEEE 3rd Industrial Electronics Society Annual On-Line Conference, ONCON 2024
BT - 2024 IEEE 3rd Industrial Electronics Society Annual On-Line Conference, ONCON 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE Industrial Electronics Society Annual On-Line Conference, ONCON 2024
Y2 - 8 December 2024 through 10 December 2024
ER -