TY - JOUR
T1 - A Comparison of Transformer and BiLSTM Based BioNER Model with Self-Training on Low-Resource Language Texts of Online Health Consultatio
AU - Purwitasari, Diana
AU - Abdillah, Abid Famasya
AU - Juanita, Safitri
AU - Purnama, I. Ketut Eddy
AU - Purnomo, Mauridhi Hery
N1 - Publisher Copyright:
© (2023), (Intelligent Network and Systems Society). All Rights Reserved.
PY - 2023
Y1 - 2023
N2 - More people have gotten used to online health consultations (OHC) because of the COVID-19 pandemic to reassure themselves of their health conditions or seek other treatment options. The OHC system could use named entity recognition (NER), specifically for health-related texts called biomedical NER (BioNER), to filter text entities from posting history to ease users' finding information. The terms of named entities (NEs) could be related to human anatomy that have some inconvenience or terms to find out any symptoms of the disease. However, OHC posts, especially user questions, are often non-formal sentences and even long sentences or have incorrect medical terms since the users are most likely non-trained medical professionals, which may lead to out-of-vocabulary (OOV) problems. Although long short-term memory (LSTM) architecture is known for its advantage in modeling sequential data like text, and even with the bidirectional version of BiLSTM, it has some difficulties handling those long sentences. A transformer model could overcome the problems. Another problem concerns fewer annotated data in low-resource language OHC texts despite data abundance in the corpus crawling from the OHC platform. To augment data training, our process includes a self-training approach as semi-supervised learning in data preparation to improve a BioNER model. In preparation for our BioNER model, this work observes and makes a comparison on the embedding step, whether stacked embedding of BiLSTM-based or fine-tuning of transformer-based and defines filtering pseudo-labels to reduce noise from self-training. Although the empirical experiments utilized OHC texts in Indonesian as a case of low-resource language texts because of our familiarity, the procedures in this work apply to Latin alphabet-based languages. We also observed other biomedical NER model creation and topic modelling for verifying the extracted entities from the resulted BioNER model to validate the procedures. The results indicate that our framework, which includes preparing data from raw texts into labelled data using self-training, with a confidence threshold of 0.85, to create the BioNER model, can give F1 scores of 0.732 and 0.838 for BiLSTM-based and transformer-based models.
AB - More people have gotten used to online health consultations (OHC) because of the COVID-19 pandemic to reassure themselves of their health conditions or seek other treatment options. The OHC system could use named entity recognition (NER), specifically for health-related texts called biomedical NER (BioNER), to filter text entities from posting history to ease users' finding information. The terms of named entities (NEs) could be related to human anatomy that have some inconvenience or terms to find out any symptoms of the disease. However, OHC posts, especially user questions, are often non-formal sentences and even long sentences or have incorrect medical terms since the users are most likely non-trained medical professionals, which may lead to out-of-vocabulary (OOV) problems. Although long short-term memory (LSTM) architecture is known for its advantage in modeling sequential data like text, and even with the bidirectional version of BiLSTM, it has some difficulties handling those long sentences. A transformer model could overcome the problems. Another problem concerns fewer annotated data in low-resource language OHC texts despite data abundance in the corpus crawling from the OHC platform. To augment data training, our process includes a self-training approach as semi-supervised learning in data preparation to improve a BioNER model. In preparation for our BioNER model, this work observes and makes a comparison on the embedding step, whether stacked embedding of BiLSTM-based or fine-tuning of transformer-based and defines filtering pseudo-labels to reduce noise from self-training. Although the empirical experiments utilized OHC texts in Indonesian as a case of low-resource language texts because of our familiarity, the procedures in this work apply to Latin alphabet-based languages. We also observed other biomedical NER model creation and topic modelling for verifying the extracted entities from the resulted BioNER model to validate the procedures. The results indicate that our framework, which includes preparing data from raw texts into labelled data using self-training, with a confidence threshold of 0.85, to create the BioNER model, can give F1 scores of 0.732 and 0.838 for BiLSTM-based and transformer-based models.
KW - Low resource language
KW - Named entity recognition
KW - Online health consultation texts
KW - Semi-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85177423493&partnerID=8YFLogxK
U2 - 10.22266/ijies2023.1231.18
DO - 10.22266/ijies2023.1231.18
M3 - Article
AN - SCOPUS:85177423493
SN - 2185-310X
VL - 16
SP - 213
EP - 224
JO - International Journal of Intelligent Engineering and Systems
JF - International Journal of Intelligent Engineering and Systems
IS - 6
ER -