Source Code Statement Classification using ANTLR and Random Forest

Hanson Prihantoro Putro, Umi Laili Yuhana, Eko Mulyanto Yuniarno, Mauridhi Hery Purnomo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In software development, source code analysis is essential to maintain software. One of the metrics used to assess software quality is software complexity, which can be calculated by analyzing source code statements like declaration, expression, and control. However, analyzing source code statements using an existing popular technique, like NLP, requires another structured form of data, which can be obtained by converting the source code into a token stream using a programming language processor. This study aims to get the best source code statement classification from six observed models and two source code forms. The models are Decision Tree, Naïve Bayes, SVM, kNN, Rocchio Algorithm and Random Forest. The source code forms are the raw source code and token stream object. The dataset collection used 562 rows of source code statements from an open-source Java project in a public repository. Next, we use ANTLR to transform the raw source code and get the token stream object. Later, the dataset was modeled using TF-IDF as an NLP technique to get the source code features. Then, six machine learning models were built and evaluated as a comparison for the classification process. As a result, Random Forest became the best model with the highest accuracy values among other machine learning models. Moreover, the token stream became the best object over source code as a contribution from the ANTLR. The model successfully predicted source code statements with a 96.1% accuracy score using the Random Forest model and ANTLR tools. In advance, this multiclass classification was also evaluated and gave a result. No declaration statement was predicted as a control statement, and vice versa (100% accuracy), but some miss prediction was observed in declaration-expression and expression-control pairs. Nevertheless, the Random Forest model's high accuracy and precision make it suitable for classifying source code statements.

Original languageEnglish
Title of host publication2023 International Seminar on Intelligent Technology and Its Applications
Subtitle of host publicationLeveraging Intelligent Systems to Achieve Sustainable Development Goals, ISITIA 2023 - Proceeding
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages60-65
Number of pages6
ISBN (Electronic)9798350313956
DOIs
Publication statusPublished - 2023
Event24th International Seminar on Intelligent Technology and Its Applications, ISITIA 2023 - Hybrid, Surabaya, Indonesia
Duration: 26 Jul 202327 Jul 2023

Publication series

Name2023 International Seminar on Intelligent Technology and Its Applications: Leveraging Intelligent Systems to Achieve Sustainable Development Goals, ISITIA 2023 - Proceeding

Conference

Conference24th International Seminar on Intelligent Technology and Its Applications, ISITIA 2023
Country/TerritoryIndonesia
CityHybrid, Surabaya
Period26/07/2327/07/23

Keywords

  • ANTLR
  • Java statement
  • Random Forest
  • classification
  • source code analysis

Fingerprint

Dive into the research topics of 'Source Code Statement Classification using ANTLR and Random Forest'. Together they form a unique fingerprint.

Cite this