Balinese story texts dataset for narrative text analyses

I. Made Satria Bimantara, Diana Purwitasari*, Ngurah Agus Sanjaya ER, Putu Gede Suarya Natha

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Automatic narrative text analysis is gaining traction as artificial intelligence-based computational linguistic tools such as named entity recognition systems and natural language processing (NLP) toolkits become more prevalent. Character identification is the first stage in narrative text analysis; however, it is difficult due to the diversity of appearances and distinctive characteristics among regions. Further challenging analyses, such as role classification, emotion and personality profiling, and character network development, require successful character identification initially, which is crucial. Because there are so many annotated English datasets, computational linguistic tools are mostly focused on English literature. However, there are restricted tools for analyzing Balinese story texts because of a scarcity of low-resource language datasets. The study presents the first annotated Balinese story texts dataset for narrative text analyses, consisting of four sub-datasets for character identification, alias clustering (named entity linking, alias resolution), and character classification. The dataset is a compilation of 120 manually annotated Balinese stories from books and public websites, spanning multiple genres such as folk tales, fairy tales, fables, and mythology. Two Balinese native speakers, including an expert in sociolinguistics and macrolinguistics, annotated the dataset using predetermined guidelines set by an expert. The inter-annotator agreement (IAA) score is calculated using Cohen's Kappa Coefficient, Jaccard Similarity Coefficient, Mean F1-score to measure the level of agreement between annotators and dataset consistency and its reliability. The first subdataset consists of 89,917 annotated words with five labels referring to the Balinese-character named entities. Each character entity's appearance in 6,634 sentences is further annotated in the second subdataset. These two sub-datasets can be used for character identification purposes at the word and sentence level. The list of character groups which are groups of various aliases for each character entity has been annotated in the third subdataset for alias clustering purposes. The third subdataset contains 930-character groups from 120 story texts with each story text containing an average of 7-to-8-character groups. In the fourth subdataset, 848-character groups—of the 930-character groups in the third subdataset—have been categorized as protagonists and antagonists. The protagonists (66.16 %) make up most character groups, with the antagonists (33.84 %) making up the rest of the groups. The fourth subdataset can be used for computing-based classification of characters into two roles between protagonist and antagonist. These datasets have the potential to improve research in narrative text analyses, especially in the areas of computational linguistic tools and advanced machine learning (ML) and deep learning (DL) models in low resource languages. It can also be used for further research including character network development, character relationship extraction, and character classification beyond protagonist and antagonist.

Original languageEnglish
Article number110781
JournalData in Brief
Volume56
DOIs
Publication statusPublished - Oct 2024

Keywords

  • Alias clustering
  • Automatic narrative text understanding
  • Character classification
  • Character extraction
  • Character identification
  • Computational linguistic
  • Named entity linking
  • Named entity recognition

Fingerprint

Dive into the research topics of 'Balinese story texts dataset for narrative text analyses'. Together they form a unique fingerprint.

Cite this