The hybrid approach has been widely used in several Natural Language Processing, including Named Entity Recognition (NER). This research proposes a NER system for Indonesian News Documents using Hybrid Conditional Random Fields (CRF) and K-Means. The hybrid approach is to try incorporating word embedding as a cluster from K-Means and take as a feature in CRF. Word embedding is a word representation technique, and it can capture the semantic meaning of the words. The clustering result from K-Means shows that similar meaning word is grouped in the cluster. We believe this feature can improve the performance of the baseline model by adding the semantic relatedness of the word from the cluster features. Word embedding in this research uses Indonesian Word2Vec. The dataset is consisting of 51,241 entities from Indonesian Online News. We conducted some experiments by dividing the corpus into training and testing dataset using percentage splitting. We used 4 scenarios for our experiments, which are 60-40, 70-30, 80-20, and 90-10. The best performance for our model was achieved in 60-40 scenario with F1-Score around 87.18% and also improves about 5.01% compared to the baseline models. We also compare our proposed methods with several models, which are BILSTM and BILSTM-CRF, from previous research. The experiments show that our model can achieve better performance by giving the best improvement of around 4.3%.
|Number of pages||13|
|Journal||International Journal of Intelligent Engineering and Systems|
|Publication status||Published - 2020|
- Hybrid approach
- Named entity recognition