TY - GEN
T1 - Early Detection of Long COVID Symptoms from Social Media Using BERT
AU - Hermawan, Alfado Rafly
AU - Hafidz, Irmasari
AU - Rangkuti, Rahmah Yasinta
AU - Latiffianti, Effi
AU - Rakhmawati, Nur Aini
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The complex variability of Long-COVID symptoms often hampers effective disease management. These symptoms can persist long after infection and are often unrecorded in medical records, particularly in mild cases that do not require hospitalization. Social media texts offer diverse sources of information on emerging health conditions. Leveraging these data enhances our understanding of Long-COVID, making Natural Language Processing (NLP) techniques essential. This study introduces an NLP that uses BERT-based models to detect Long COVID symptoms in social media posts. Data from social media platform Twitter were collected using the keyword #LongCovid, followed by a multi-stage preprocessing process that included text cleaning and lexicon-based text filtering. Three models-BERT, BioBERT, and Bio+Clinical BERT-were fine-tuned and evaluated based on their F1 scores. The experimental results demonstrate that the general BERT model outperformed the domain-specific BioBERT and Bio+Clinical BERT models, achieving the highest F1 score in multi-label text classification with an F1 score of 89.90%. This finding highlights BERT's suitability for processing informal language on social media platforms and suggests that general-purpose language models may be more effective for health surveillance on these platforms than models pretrained on medical data. Our study contributes to academic understanding by demonstrating effective symptom identification for health monitoring, particularly of Long-COVID symptoms, from social media data.
AB - The complex variability of Long-COVID symptoms often hampers effective disease management. These symptoms can persist long after infection and are often unrecorded in medical records, particularly in mild cases that do not require hospitalization. Social media texts offer diverse sources of information on emerging health conditions. Leveraging these data enhances our understanding of Long-COVID, making Natural Language Processing (NLP) techniques essential. This study introduces an NLP that uses BERT-based models to detect Long COVID symptoms in social media posts. Data from social media platform Twitter were collected using the keyword #LongCovid, followed by a multi-stage preprocessing process that included text cleaning and lexicon-based text filtering. Three models-BERT, BioBERT, and Bio+Clinical BERT-were fine-tuned and evaluated based on their F1 scores. The experimental results demonstrate that the general BERT model outperformed the domain-specific BioBERT and Bio+Clinical BERT models, achieving the highest F1 score in multi-label text classification with an F1 score of 89.90%. This finding highlights BERT's suitability for processing informal language on social media platforms and suggests that general-purpose language models may be more effective for health surveillance on these platforms than models pretrained on medical data. Our study contributes to academic understanding by demonstrating effective symptom identification for health monitoring, particularly of Long-COVID symptoms, from social media data.
KW - BERT
KW - Long COVID
KW - SGD 3
KW - Symptoms Detection
KW - Text Classification
KW - Twitter
UR - https://www.scopus.com/pages/publications/85217239511
U2 - 10.1109/DASA63652.2024.10836286
DO - 10.1109/DASA63652.2024.10836286
M3 - Conference contribution
AN - SCOPUS:85217239511
T3 - 2024 International Conference on Decision Aid Sciences and Applications, DASA 2024
BT - 2024 International Conference on Decision Aid Sciences and Applications, DASA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Conference on Decision Aid Sciences and Applications, DASA 2024
Y2 - 11 December 2024 through 12 December 2024
ER -