Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging for Multi-document Summarization

Agus Zainal Arifin*, Moch Zawaruddin Abdullah, Ahmad Wahyu Rosyadi, Desepta Isna Ulumi, Aminul Wahib, Rizka Wakhidatus Sholikah

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Automatic multi-document summarization needs to find representative sentences not only by sentence distribution to select the most important sentence but also by how informative a term is in a sentence. Sentence distrib ution is suitable for ob taining important sentences by determining frequent and well-spread words in the corpus but ignores the grammatical information that indicates instructive content. The presence or ab sence of informative content in a sentence can b e indicated by grammatical information which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization. Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering is based on cluster importance to determine the important clusters. Sentence extraction based on sentence distribution and POS tagging is introduced to extract the representative sentences from the ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004 are compared with those of the Sentence Distribution Method. Our proposed method achieved better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.

Original languageEnglish
Pages (from-to)843-851
Number of pages9
JournalTelkomnika (Telecommunication Computing Electronics and Control)
Volume16
Issue number2
DOIs
Publication statusPublished - Apr 2018

Keywords

  • multi-document summarization
  • pos tagging
  • sentence distribution

Fingerprint

Dive into the research topics of 'Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging for Multi-document Summarization'. Together they form a unique fingerprint.

Cite this