TY - GEN
T1 - Undersampling Data Augmentation for BotNet Classification
AU - Sierra, Evelyn
AU - Ahmad, Tohari
AU - Putra, Muhammad Aidiel Rachman
N1 - Publisher Copyright:
©2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Botnets pose significant threats to cybersecurity by exploiting large networks of compromised devices for malicious activities. Traditional Botnet detection methods often struggle with the evolving tactics employed by botnet operators, leading to high false positive rates and reduced detection accuracy. Data, which are employed to train the system, are often an issue relative to achieving good performance. This is because imbalanced data are typically found in the dataset. For example, the dataset used in this study has 0.97 background, 0.23 normal, and 0.07 botnet activities; each activity comprises network traffic and labels for each network. This study investigates this potential problem by implementing text preprocessing in the initial step to obtain clean labels for each traffic network in the dataset. Furthermore, this study employs RandomUnderSampler to ensure that samples from each label reach 2000 data points. Subsequently, classification experiments are conducted using the Random Forest, Decision Tree, Support Vector Machine, k-Nearest Neighbor, and Logistic Regression methods. The results indicate that Random Forest with a RandomUnderSampler ratio of 3:2:1 achieves the highest accuracy rate, reaching 0.97. In addition, the model exhibited 0.97 precision, 0.95 recall, and 0.96 F1 score.
AB - Botnets pose significant threats to cybersecurity by exploiting large networks of compromised devices for malicious activities. Traditional Botnet detection methods often struggle with the evolving tactics employed by botnet operators, leading to high false positive rates and reduced detection accuracy. Data, which are employed to train the system, are often an issue relative to achieving good performance. This is because imbalanced data are typically found in the dataset. For example, the dataset used in this study has 0.97 background, 0.23 normal, and 0.07 botnet activities; each activity comprises network traffic and labels for each network. This study investigates this potential problem by implementing text preprocessing in the initial step to obtain clean labels for each traffic network in the dataset. Furthermore, this study employs RandomUnderSampler to ensure that samples from each label reach 2000 data points. Subsequently, classification experiments are conducted using the Random Forest, Decision Tree, Support Vector Machine, k-Nearest Neighbor, and Logistic Regression methods. The results indicate that Random Forest with a RandomUnderSampler ratio of 3:2:1 achieves the highest accuracy rate, reaching 0.97. In addition, the model exhibited 0.97 precision, 0.95 recall, and 0.96 F1 score.
KW - Botnet
KW - Computer Security
KW - Information Security
KW - Network Infrastructure
KW - Network Security
UR - http://www.scopus.com/inward/record.url?scp=85212864958&partnerID=8YFLogxK
U2 - 10.1109/ICCCNT61001.2024.10723957
DO - 10.1109/ICCCNT61001.2024.10723957
M3 - Conference contribution
AN - SCOPUS:85212864958
T3 - 2024 15th International Conference on Computing Communication and Networking Technologies, ICCCNT 2024
BT - 2024 15th International Conference on Computing Communication and Networking Technologies, ICCCNT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th International Conference on Computing Communication and Networking Technologies, ICCCNT 2024
Y2 - 24 June 2024 through 28 June 2024
ER -