TY - JOUR
T1 - Graph Model and Deep Learning for Topic Labels in Classifying Short Texts of Scientific Article Titles
AU - Sumpeno, Surya
AU - Purwitasari, Diana
AU - Farandy, Bastian
AU - Navastara, Dini Adni
AU - Purnomo, Mauridhi Hery
N1 - Publisher Copyright:
© 2022,International Journal of Intelligent Engineering and Systems.All Rights Reserved
PY - 2022/4
Y1 - 2022/4
N2 - In an article searching system, topic categorization could guide researchers in finding the appropriate documents among the abundant availability of scientific articles. Because they are easily obtained from the internet, there is a preference for using short texts to full-text articles for data collection of the searching system. However, topic label scarcity becomes a problem, especially when preparing the system in a cold-start situation or without predefined topic categories. Typical topic analysis with a statistical-based unsupervised Latent Dirichlet Allocation (LDA) identifies clusters of words or topics based on the word distribution to overcome the label scarcity. For ease of use, words in LDA topics are manually observed, and some are set as topic names. Lower precision happened when tagging other articles using the LDA topics for the searching system preparation with categorization or classification approach. The precision values could be influenced by too many identified LDA topics. Thus, the overlapped context in the LDA results is possible since the same words appear in different topics, leading to many false positives. Here, the instigated problem is making classification results have comparable accuracy and precision values with the existing data condition of no topic labels and overlapped context when identifying topics. The problem solution is motivated to consider the word relations with others when identifying topics to differentiate the word context. Therefore, our contribution is to investigate LDA and relationships between words as a graph with a prevalent neural network model of deep learning called Graph Convolutional Network (GCN) for automatically determining the topics before examining them in classification tasks. Guided by the proposed framework, we synthesize training samples to make the dataset for LDA topics more similar in contexts. The empirical analysis through the experiments has thoroughly evaluated the LDA topics as a baseline to compare the results of statistical based (LDA) and Deep Learning based topic identification (Deep LDA) to ensure the topic quality. Then, we compared the usage of GCN with other frequently used text classifications.
AB - In an article searching system, topic categorization could guide researchers in finding the appropriate documents among the abundant availability of scientific articles. Because they are easily obtained from the internet, there is a preference for using short texts to full-text articles for data collection of the searching system. However, topic label scarcity becomes a problem, especially when preparing the system in a cold-start situation or without predefined topic categories. Typical topic analysis with a statistical-based unsupervised Latent Dirichlet Allocation (LDA) identifies clusters of words or topics based on the word distribution to overcome the label scarcity. For ease of use, words in LDA topics are manually observed, and some are set as topic names. Lower precision happened when tagging other articles using the LDA topics for the searching system preparation with categorization or classification approach. The precision values could be influenced by too many identified LDA topics. Thus, the overlapped context in the LDA results is possible since the same words appear in different topics, leading to many false positives. Here, the instigated problem is making classification results have comparable accuracy and precision values with the existing data condition of no topic labels and overlapped context when identifying topics. The problem solution is motivated to consider the word relations with others when identifying topics to differentiate the word context. Therefore, our contribution is to investigate LDA and relationships between words as a graph with a prevalent neural network model of deep learning called Graph Convolutional Network (GCN) for automatically determining the topics before examining them in classification tasks. Guided by the proposed framework, we synthesize training samples to make the dataset for LDA topics more similar in contexts. The empirical analysis through the experiments has thoroughly evaluated the LDA topics as a baseline to compare the results of statistical based (LDA) and Deep Learning based topic identification (Deep LDA) to ensure the topic quality. Then, we compared the usage of GCN with other frequently used text classifications.
KW - Graph convolutional network
KW - Latent dirichlet allocation
KW - Short text classification
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85126808530&partnerID=8YFLogxK
U2 - 10.22266/ijies2022.0430.41
DO - 10.22266/ijies2022.0430.41
M3 - Article
AN - SCOPUS:85126808530
SN - 2185-310X
VL - 15
SP - 454
EP - 463
JO - International Journal of Intelligent Engineering and Systems
JF - International Journal of Intelligent Engineering and Systems
IS - 2
ER -