TY - JOUR
T1 - Application of a combination between Principal Component Analysis and Logistic Regression Based on Support Vector Machine on Educational Data Mining with Overlapping Data Problem
AU - Mutrofin, Siti
AU - Maisarah, Maisarah
AU - Widodo, Slamet
AU - Ginardi, Raden Venantius Hari
AU - Fatichah, Chastine
N1 - Publisher Copyright:
© Published under licence by IOP Publishing Ltd.
PY - 2020/7/1
Y1 - 2020/7/1
N2 - In 2019, the government of the Republic of Indonesia issued a zoning-based policy for New Student Admissions (PPDB) from the level of elementary school (SD) to high school (SMA), especially for public schools. The policy is documented in Permendikbud No.51 / 2018. The government policy aims to ensure the equality of education and make prospective students not focus only on favorite schools. However, this policy raises new problems. One of them is that if the potential student has got a medium UN (National Examination) score or medium distance of the house to the destination school, then his potential to be accepted at the destination school is very small. It is even worse if the potential students do not know the lowest score and the farthest distance the destination school can accept. Thus, potential students will choose schools by only guessing without basing on valid data, so their chances of being accepted will be very small. This current research focused on Educational Data Mining at PPDB Public High School (SMA) in Jombang in the academic year 2019/2020 which aims to accommodate the needs of potential students to predict the destination schools based on their own grades and home distances using classification techniques of data mining. However, another problem emerged in this study. An overlapping data occurred where one data was also owned by more than one class. For example, a potential student of SMA Negeri 2 Jombang (SMAN 2 Jombang) has got a score of 80 in Bahasa Indonesia subject, which is the same as that of a student from SMA Negeri 3 Jombang (SMAN 3 Jombang). Data overlapping does not only occur in one data but almost all of the data. The data used in this study were 600 data, consisting of 308 from PPDB 2019 of SMAN 2 Jombang, and the rest were from SMAN 3 Jombang. The attributes used were the home distance from the destination school, overall UN scores, UN scores of Mathematics, Natural Sciences, Bahasa Indonesia, and English subjects. The algorithm used was a combination of Principal Component Analysis (PCA) with Logistic Regression (LR)-based Support Vector Machine (SVM) with Anova kernel. The validation applied 10-fold cross-validation and the evaluation of algorithm performance used the aspects of accuracy, precision, and recall. The results of this current study showed an accuracy of 94.33%, a precision of 96.28%, and a recall of 92.53%. The results were better than those that did not apply PCA (70.83% accuracy, 69.62% precision, and 76.62% recall). By PCA, data could be seen from another angle that could separate or differentiate one class from the others. Even though there were 100% overlapping data, none of them, from all attributes, was 100% exactly the same.
AB - In 2019, the government of the Republic of Indonesia issued a zoning-based policy for New Student Admissions (PPDB) from the level of elementary school (SD) to high school (SMA), especially for public schools. The policy is documented in Permendikbud No.51 / 2018. The government policy aims to ensure the equality of education and make prospective students not focus only on favorite schools. However, this policy raises new problems. One of them is that if the potential student has got a medium UN (National Examination) score or medium distance of the house to the destination school, then his potential to be accepted at the destination school is very small. It is even worse if the potential students do not know the lowest score and the farthest distance the destination school can accept. Thus, potential students will choose schools by only guessing without basing on valid data, so their chances of being accepted will be very small. This current research focused on Educational Data Mining at PPDB Public High School (SMA) in Jombang in the academic year 2019/2020 which aims to accommodate the needs of potential students to predict the destination schools based on their own grades and home distances using classification techniques of data mining. However, another problem emerged in this study. An overlapping data occurred where one data was also owned by more than one class. For example, a potential student of SMA Negeri 2 Jombang (SMAN 2 Jombang) has got a score of 80 in Bahasa Indonesia subject, which is the same as that of a student from SMA Negeri 3 Jombang (SMAN 3 Jombang). Data overlapping does not only occur in one data but almost all of the data. The data used in this study were 600 data, consisting of 308 from PPDB 2019 of SMAN 2 Jombang, and the rest were from SMAN 3 Jombang. The attributes used were the home distance from the destination school, overall UN scores, UN scores of Mathematics, Natural Sciences, Bahasa Indonesia, and English subjects. The algorithm used was a combination of Principal Component Analysis (PCA) with Logistic Regression (LR)-based Support Vector Machine (SVM) with Anova kernel. The validation applied 10-fold cross-validation and the evaluation of algorithm performance used the aspects of accuracy, precision, and recall. The results of this current study showed an accuracy of 94.33%, a precision of 96.28%, and a recall of 92.53%. The results were better than those that did not apply PCA (70.83% accuracy, 69.62% precision, and 76.62% recall). By PCA, data could be seen from another angle that could separate or differentiate one class from the others. Even though there were 100% overlapping data, none of them, from all attributes, was 100% exactly the same.
UR - http://www.scopus.com/inward/record.url?scp=85087894571&partnerID=8YFLogxK
U2 - 10.1088/1757-899X/874/1/012018
DO - 10.1088/1757-899X/874/1/012018
M3 - Conference article
AN - SCOPUS:85087894571
SN - 1757-8981
VL - 874
JO - IOP Conference Series: Materials Science and Engineering
JF - IOP Conference Series: Materials Science and Engineering
IS - 1
M1 - 012018
T2 - 2019 International Conference on Engineering, Technologies, and Applied Sciences, ICETsAS 2019
Y2 - 17 October 2019 through 18 October 2019
ER -