Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings

Renny Pradina Kusumawardani*, Stezar Priansya, Faizal Johan Atletiko

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

7 Citations (Scopus)

Abstract

We present our work in the normalization of social media texts in Bahasa Indonesia. To capture the contextual meaning of tokens, we create a neural word embeddings using word2vec trained on over a million social media messages representing a mix of domains and degrees of linguistic deviations from standard Bahasa Indonesia. For each token to be normalized, the embeddings is used for generating candidates from vocabulary words. To select from among these candidates, we use a scoring combining their contextual similarity to the token as gauged by their proximity in the embeddings vector space with their orthographical similarity measured using the Levenshtein and Jaro-Winkler distances. For normalization of individual words, we observe that detecting whether a token actually represent an incorrectly spelled word is at least as important as finding the correct normalization. However, in the task of normalizing entire messages, the system achieves a highest accuracy of 79.59%, suggesting that our approach is quite promising and worthy of further exploration. Furthermore, in this paper we also discuss some observations we made on the use of the neural word embeddings in the processing of informal Bahasa Indonesia texts, especially in the social media.

Original languageEnglish
Pages (from-to)105-117
Number of pages13
JournalProcedia Computer Science
Volume144
DOIs
Publication statusPublished - 2018
Event3rd International Neural Network Society Conference on Big Data and Deep Learning, INNS BDDL 2018 - Sanur, Bali, Indonesia
Duration: 17 Apr 201819 Apr 2018

Keywords

  • Bahasa Indonesia
  • Deep Learning
  • Normalization
  • Social Media
  • Word Embeddings
  • Word2Vec

Fingerprint

Dive into the research topics of 'Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings'. Together they form a unique fingerprint.

Cite this