Aim of the research is to conduct automatic Indonesian synonym sets gloss extraction using the supervised learning approach. The main sources used are collections of web documents containing the gloss of the synonym sets. Three main phases of the proposed method are: preprocessing phase, features extraction phase, and classification phase. Preprocessing phase includes large scale fetch of web documents collection, extraction of raw text, text clean-up, and extraction of sentence from possible gloss candidates. Furthermore, in the features extraction phase, seven features are extracted from each of the gloss candidates: the position of a sentence in a paragraph, the frequency of a sentence in the document collection, the number of words in a sentence, the number of important words in a sentence, the number of characters in a sentence, the number of gloss sentences from the same word, and the number of nouns in the sentence. Lastly, in the classification phase, the supervised learning method will then accept or reject the candidate as a true gloss based on those seven features. It is shown in this paper that the proposed system was successful in acquiring 6,520 Indonesian synset glosses, with an average accuracy of 74.06% and 75.40% using the decision tree and backpropagation feedforward neural networks respectively. Thus, with the vast amount of successfully acquired glosses which is quite significant for Indonesian words, it is believed that the supervised learning approach used in this research will be useful to accelerate the process of lexical database formation such as WordNet for other languages.
|Number of pages||10|
|Journal||IAENG International Journal of Computer Science|
|Publication status||Published - 2015|
- Gloss acquisition
- Indonesian language
- Supervised learning
- Word net