Abstract

In an article searching system, topic categorization could guide researchers in finding the appropriate documents among the abundant availability of scientific articles. Because they are easily obtained from the internet, there is a preference for using short texts to full-text articles for data collection of the searching system. However, topic label scarcity becomes a problem, especially when preparing the system in a cold-start situation or without predefined topic categories. Typical topic analysis with a statistical-based unsupervised Latent Dirichlet Allocation (LDA) identifies clusters of words or topics based on the word distribution to overcome the label scarcity. For ease of use, words in LDA topics are manually observed, and some are set as topic names. Lower precision happened when tagging other articles using the LDA topics for the searching system preparation with categorization or classification approach. The precision values could be influenced by too many identified LDA topics. Thus, the overlapped context in the LDA results is possible since the same words appear in different topics, leading to many false positives. Here, the instigated problem is making classification results have comparable accuracy and precision values with the existing data condition of no topic labels and overlapped context when identifying topics. The problem solution is motivated to consider the word relations with others when identifying topics to differentiate the word context. Therefore, our contribution is to investigate LDA and relationships between words as a graph with a prevalent neural network model of deep learning called Graph Convolutional Network (GCN) for automatically determining the topics before examining them in classification tasks. Guided by the proposed framework, we synthesize training samples to make the dataset for LDA topics more similar in contexts. The empirical analysis through the experiments has thoroughly evaluated the LDA topics as a baseline to compare the results of statistical based (LDA) and Deep Learning based topic identification (Deep LDA) to ensure the topic quality. Then, we compared the usage of GCN with other frequently used text classifications.

Original languageEnglish
Pages (from-to)454-463
Number of pages10
JournalInternational Journal of Intelligent Engineering and Systems
Volume15
Issue number2
DOIs
Publication statusPublished - Apr 2022

Keywords

  • Graph convolutional network
  • Latent dirichlet allocation
  • Short text classification
  • Topic modeling

Fingerprint

Dive into the research topics of 'Graph Model and Deep Learning for Topic Labels in Classifying Short Texts of Scientific Article Titles'. Together they form a unique fingerprint.

Cite this