Study of parameters of the nearest neighbour shared algorithm on clustering documents

Alvida Mustika Rukmi, Daryono Budi Utomo, Neni Imro atus Sholikhah

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

Document clustering is one way of automatically managing documents, extracting of document topics and fastly filtering information. Preprocess of clustering documents processed by textmining consists of: keyword extraction using Rapid Automatic Keyphrase Extraction (RAKE) and making the document as concept vector using Latent Semantic Analysis (LSA). Furthermore, the clustering process is done so that the documents with the similarity of the topic are in the same cluster, based on the preprocesing by textmining performed. Shared Nearest Neighbour (SNN) algorithm is a clustering method based on the number of "nearest neighbors" shared. The parameters in the SNN Algorithm consist of: k nearest neighbor documents, ϵ shared nearest neighbor documents and MinT minimum number of similar documents, which can form a cluster. Characteristics The SNN algorithm is based on shared 'neighbor' properties. Each cluster is formed by keywords that are shared by the documents. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of appearing keywords in document is also high. Determination of parameter values on SNN algorithm affects document clustering results. The higher parameter value k, will increase the number of neighbor documents from each document, cause similarity of neighboring documents are lower. The accuracy of each cluster is also low. The higher parameter value ϵ, caused each document catch only neighbor documents that have a high similarity to build a cluster. It also causes more unclassified documents (noise). The higher the MinT parameter value cause the number of clusters will decrease, since the number of similar documents can not form clusters if less than MinT. Parameter in the SNN Algorithm determine performance of clustering result and the amount of noise (unclustered documents). The Silhouette coeffisient shows almost the same result in many experiments, above 0.9, which means that SNN algorithm works well with different parameter values.

Original languageEnglish
Article number012061
JournalJournal of Physics: Conference Series
Volume974
Issue number1
DOIs
Publication statusPublished - 22 Mar 2018
Event3rd International Conference on Mathematics: Pure, Applied and Computation, ICoMPAC 2017 - Surabaya, Indonesia
Duration: 1 Nov 20171 Nov 2017

Fingerprint

Dive into the research topics of 'Study of parameters of the nearest neighbour shared algorithm on clustering documents'. Together they form a unique fingerprint.

Cite this