TY - JOUR
T1 - Balinese story texts dataset for narrative text analyses
AU - Bimantara, I. Made Satria
AU - Purwitasari, Diana
AU - ER, Ngurah Agus Sanjaya
AU - Natha, Putu Gede Suarya
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/10
Y1 - 2024/10
N2 - Automatic narrative text analysis is gaining traction as artificial intelligence-based computational linguistic tools such as named entity recognition systems and natural language processing (NLP) toolkits become more prevalent. Character identification is the first stage in narrative text analysis; however, it is difficult due to the diversity of appearances and distinctive characteristics among regions. Further challenging analyses, such as role classification, emotion and personality profiling, and character network development, require successful character identification initially, which is crucial. Because there are so many annotated English datasets, computational linguistic tools are mostly focused on English literature. However, there are restricted tools for analyzing Balinese story texts because of a scarcity of low-resource language datasets. The study presents the first annotated Balinese story texts dataset for narrative text analyses, consisting of four sub-datasets for character identification, alias clustering (named entity linking, alias resolution), and character classification. The dataset is a compilation of 120 manually annotated Balinese stories from books and public websites, spanning multiple genres such as folk tales, fairy tales, fables, and mythology. Two Balinese native speakers, including an expert in sociolinguistics and macrolinguistics, annotated the dataset using predetermined guidelines set by an expert. The inter-annotator agreement (IAA) score is calculated using Cohen's Kappa Coefficient, Jaccard Similarity Coefficient, Mean F1-score to measure the level of agreement between annotators and dataset consistency and its reliability. The first subdataset consists of 89,917 annotated words with five labels referring to the Balinese-character named entities. Each character entity's appearance in 6,634 sentences is further annotated in the second subdataset. These two sub-datasets can be used for character identification purposes at the word and sentence level. The list of character groups which are groups of various aliases for each character entity has been annotated in the third subdataset for alias clustering purposes. The third subdataset contains 930-character groups from 120 story texts with each story text containing an average of 7-to-8-character groups. In the fourth subdataset, 848-character groups—of the 930-character groups in the third subdataset—have been categorized as protagonists and antagonists. The protagonists (66.16 %) make up most character groups, with the antagonists (33.84 %) making up the rest of the groups. The fourth subdataset can be used for computing-based classification of characters into two roles between protagonist and antagonist. These datasets have the potential to improve research in narrative text analyses, especially in the areas of computational linguistic tools and advanced machine learning (ML) and deep learning (DL) models in low resource languages. It can also be used for further research including character network development, character relationship extraction, and character classification beyond protagonist and antagonist.
AB - Automatic narrative text analysis is gaining traction as artificial intelligence-based computational linguistic tools such as named entity recognition systems and natural language processing (NLP) toolkits become more prevalent. Character identification is the first stage in narrative text analysis; however, it is difficult due to the diversity of appearances and distinctive characteristics among regions. Further challenging analyses, such as role classification, emotion and personality profiling, and character network development, require successful character identification initially, which is crucial. Because there are so many annotated English datasets, computational linguistic tools are mostly focused on English literature. However, there are restricted tools for analyzing Balinese story texts because of a scarcity of low-resource language datasets. The study presents the first annotated Balinese story texts dataset for narrative text analyses, consisting of four sub-datasets for character identification, alias clustering (named entity linking, alias resolution), and character classification. The dataset is a compilation of 120 manually annotated Balinese stories from books and public websites, spanning multiple genres such as folk tales, fairy tales, fables, and mythology. Two Balinese native speakers, including an expert in sociolinguistics and macrolinguistics, annotated the dataset using predetermined guidelines set by an expert. The inter-annotator agreement (IAA) score is calculated using Cohen's Kappa Coefficient, Jaccard Similarity Coefficient, Mean F1-score to measure the level of agreement between annotators and dataset consistency and its reliability. The first subdataset consists of 89,917 annotated words with five labels referring to the Balinese-character named entities. Each character entity's appearance in 6,634 sentences is further annotated in the second subdataset. These two sub-datasets can be used for character identification purposes at the word and sentence level. The list of character groups which are groups of various aliases for each character entity has been annotated in the third subdataset for alias clustering purposes. The third subdataset contains 930-character groups from 120 story texts with each story text containing an average of 7-to-8-character groups. In the fourth subdataset, 848-character groups—of the 930-character groups in the third subdataset—have been categorized as protagonists and antagonists. The protagonists (66.16 %) make up most character groups, with the antagonists (33.84 %) making up the rest of the groups. The fourth subdataset can be used for computing-based classification of characters into two roles between protagonist and antagonist. These datasets have the potential to improve research in narrative text analyses, especially in the areas of computational linguistic tools and advanced machine learning (ML) and deep learning (DL) models in low resource languages. It can also be used for further research including character network development, character relationship extraction, and character classification beyond protagonist and antagonist.
KW - Alias clustering
KW - Automatic narrative text understanding
KW - Character classification
KW - Character extraction
KW - Character identification
KW - Computational linguistic
KW - Named entity linking
KW - Named entity recognition
UR - http://www.scopus.com/inward/record.url?scp=85201285246&partnerID=8YFLogxK
U2 - 10.1016/j.dib.2024.110781
DO - 10.1016/j.dib.2024.110781
M3 - Article
AN - SCOPUS:85201285246
SN - 2352-3409
VL - 56
JO - Data in Brief
JF - Data in Brief
M1 - 110781
ER -