TY - JOUR
T1 - Sentence extraction based on sentence distribution and part of speech tagging for multi-document summarization
AU - Arifin, Agus Zainal
AU - Abdullah, Moch Zawaruddin
AU - Rosyadi, Ahmad Wahyu
AU - Ulumi, Desepta Isna
AU - Wahib, Aminul
AU - Sholikah, Rizka Wakhidatus
N1 - Publisher Copyright:
© 2018 Universitas Ahmad Dahlan.
PY - 2018/4
Y1 - 2018/4
N2 - Automatic multi-document summarization needs to find representative sentences not only by sentence distribution to select the most important sentence but also by how informative a term is in a sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and well-spread words in the corpus but ignores the grammatical information that indicates instructive content. The presence or absence of informative content in a sentence can be indicated by grammatical information which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization. Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering is based on cluster importance to determine the important clusters. Sentence extraction based on sentence distribution and POS tagging is introduced to extract the representative sentences from the ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004 are compared with those of the Sentence Distribution Method. Our proposed method achieved better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
AB - Automatic multi-document summarization needs to find representative sentences not only by sentence distribution to select the most important sentence but also by how informative a term is in a sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and well-spread words in the corpus but ignores the grammatical information that indicates instructive content. The presence or absence of informative content in a sentence can be indicated by grammatical information which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization. Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering is based on cluster importance to determine the important clusters. Sentence extraction based on sentence distribution and POS tagging is introduced to extract the representative sentences from the ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004 are compared with those of the Sentence Distribution Method. Our proposed method achieved better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
KW - Multi-document summarization
KW - Pos tagging
KW - Sentence distribution
UR - http://www.scopus.com/inward/record.url?scp=85048837064&partnerID=8YFLogxK
U2 - 10.12928/TELKOMNIKA.v16i2.8431
DO - 10.12928/TELKOMNIKA.v16i2.8431
M3 - Article
AN - SCOPUS:85048837064
SN - 1693-6930
VL - 16
SP - 843
EP - 851
JO - Telkomnika (Telecommunication Computing Electronics and Control)
JF - Telkomnika (Telecommunication Computing Electronics and Control)
IS - 2
ER -