TY - GEN
T1 - Source Code Statement Classification using ANTLR and Random Forest
AU - Putro, Hanson Prihantoro
AU - Yuhana, Umi Laili
AU - Yuniarno, Eko Mulyanto
AU - Purnomo, Mauridhi Hery
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In software development, source code analysis is essential to maintain software. One of the metrics used to assess software quality is software complexity, which can be calculated by analyzing source code statements like declaration, expression, and control. However, analyzing source code statements using an existing popular technique, like NLP, requires another structured form of data, which can be obtained by converting the source code into a token stream using a programming language processor. This study aims to get the best source code statement classification from six observed models and two source code forms. The models are Decision Tree, Naïve Bayes, SVM, kNN, Rocchio Algorithm and Random Forest. The source code forms are the raw source code and token stream object. The dataset collection used 562 rows of source code statements from an open-source Java project in a public repository. Next, we use ANTLR to transform the raw source code and get the token stream object. Later, the dataset was modeled using TF-IDF as an NLP technique to get the source code features. Then, six machine learning models were built and evaluated as a comparison for the classification process. As a result, Random Forest became the best model with the highest accuracy values among other machine learning models. Moreover, the token stream became the best object over source code as a contribution from the ANTLR. The model successfully predicted source code statements with a 96.1% accuracy score using the Random Forest model and ANTLR tools. In advance, this multiclass classification was also evaluated and gave a result. No declaration statement was predicted as a control statement, and vice versa (100% accuracy), but some miss prediction was observed in declaration-expression and expression-control pairs. Nevertheless, the Random Forest model's high accuracy and precision make it suitable for classifying source code statements.
AB - In software development, source code analysis is essential to maintain software. One of the metrics used to assess software quality is software complexity, which can be calculated by analyzing source code statements like declaration, expression, and control. However, analyzing source code statements using an existing popular technique, like NLP, requires another structured form of data, which can be obtained by converting the source code into a token stream using a programming language processor. This study aims to get the best source code statement classification from six observed models and two source code forms. The models are Decision Tree, Naïve Bayes, SVM, kNN, Rocchio Algorithm and Random Forest. The source code forms are the raw source code and token stream object. The dataset collection used 562 rows of source code statements from an open-source Java project in a public repository. Next, we use ANTLR to transform the raw source code and get the token stream object. Later, the dataset was modeled using TF-IDF as an NLP technique to get the source code features. Then, six machine learning models were built and evaluated as a comparison for the classification process. As a result, Random Forest became the best model with the highest accuracy values among other machine learning models. Moreover, the token stream became the best object over source code as a contribution from the ANTLR. The model successfully predicted source code statements with a 96.1% accuracy score using the Random Forest model and ANTLR tools. In advance, this multiclass classification was also evaluated and gave a result. No declaration statement was predicted as a control statement, and vice versa (100% accuracy), but some miss prediction was observed in declaration-expression and expression-control pairs. Nevertheless, the Random Forest model's high accuracy and precision make it suitable for classifying source code statements.
KW - ANTLR
KW - Java statement
KW - Random Forest
KW - classification
KW - source code analysis
UR - http://www.scopus.com/inward/record.url?scp=85171138008&partnerID=8YFLogxK
U2 - 10.1109/ISITIA59021.2023.10220999
DO - 10.1109/ISITIA59021.2023.10220999
M3 - Conference contribution
AN - SCOPUS:85171138008
T3 - 2023 International Seminar on Intelligent Technology and Its Applications: Leveraging Intelligent Systems to Achieve Sustainable Development Goals, ISITIA 2023 - Proceeding
SP - 60
EP - 65
BT - 2023 International Seminar on Intelligent Technology and Its Applications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th International Seminar on Intelligent Technology and Its Applications, ISITIA 2023
Y2 - 26 July 2023 through 27 July 2023
ER -