Abstract

Bilingual lexicons are essential resources in natural language processing (NLP) and information retrieval (IR). Automatic bilingual lexicon acquisition relies on a large number of parallel corpora that can be scarce or even unavailable for several languages. On the other hand, there are other resources that can be used to build bilingual lexicon such as comparable corpora (aligned documents) and monolingual corpora that are easily to get and available in any language, including resource-limited languages. Hence, this paper proposes a two stages framework that can learn bilingual lexicons from monolingual corpora enhanced using comparable corpora without any additional resources. The framework consists of two stages: comparable dictionary building and monolingual mapping. Comparable dictionary building is a process to create coarse dictionary from comparable corpora by utilizing topic modeling approach. The second stage is monolingual mapping by using the result from the previous stage as seed initialization for the bi-directional projection learning. The utilization of comparable corpora can replace the need of bilingual dictionary. The experiment was conducted using three kinds of language pairs: English-®Indonesia, English-®Arabic and Arabic-®Indonesia. The result of the experiment showed that the proposed method can enhance the accuracy from monolingual corpora and outperform other previous methods.

Original languageEnglish
Pages (from-to)379-391
Number of pages13
JournalInternational Journal of Intelligent Engineering and Systems
Volume13
Issue number5
DOIs
Publication statusPublished - 1 Oct 2020

Keywords

  • Bilingual lexicon
  • Comparable corpora
  • Enhanced-mono
  • Hubness problem
  • Linear mapping
  • Monolingual corpora

Fingerprint

Dive into the research topics of 'Exploiting comparable corpora to enhance bilingual lexicon induction from monolingual corpora'. Together they form a unique fingerprint.

Cite this