Hybrid conditional random fields and k-means for named entity recognition on indonesian news documents

Joan Santoso*, Esther Irawati Setiawan, Eko Mulyanto Yuniarno, Mochamad Hariadi, Mauridhi Hery Purnomo*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

11 Citations (Scopus)

Abstract

The hybrid approach has been widely used in several Natural Language Processing, including Named Entity Recognition (NER). This research proposes a NER system for Indonesian News Documents using Hybrid Conditional Random Fields (CRF) and K-Means. The hybrid approach is to try incorporating word embedding as a cluster from K-Means and take as a feature in CRF. Word embedding is a word representation technique, and it can capture the semantic meaning of the words. The clustering result from K-Means shows that similar meaning word is grouped in the cluster. We believe this feature can improve the performance of the baseline model by adding the semantic relatedness of the word from the cluster features. Word embedding in this research uses Indonesian Word2Vec. The dataset is consisting of 51,241 entities from Indonesian Online News. We conducted some experiments by dividing the corpus into training and testing dataset using percentage splitting. We used 4 scenarios for our experiments, which are 60-40, 70-30, 80-20, and 90-10. The best performance for our model was achieved in 60-40 scenario with F1-Score around 87.18% and also improves about 5.01% compared to the baseline models. We also compare our proposed methods with several models, which are BILSTM and BILSTM-CRF, from previous research. The experiments show that our model can achieve better performance by giving the best improvement of around 4.3%.

Original languageEnglish
Pages (from-to)233-245
Number of pages13
JournalInternational Journal of Intelligent Engineering and Systems
Volume13
Issue number3
DOIs
Publication statusPublished - 2020

Keywords

  • CRF
  • Hybrid approach
  • Indonesian
  • K-means
  • Named entity recognition
  • Word2Vec

Fingerprint

Dive into the research topics of 'Hybrid conditional random fields and k-means for named entity recognition on indonesian news documents'. Together they form a unique fingerprint.

Cite this