TY - JOUR
T1 - Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM
AU - Atmaja, Bagus Tris
AU - Akagi, Masato
N1 - Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2021/2
Y1 - 2021/2
N2 - Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to taking further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling, because it can enable deeper analysis of affective states. Experimental results show that this two-stage, late-fusion approach, obtains higher performance than that of any one-stage processing, with a linear correlation from one-stage to two-stage processing. This late-fusion approach improves previous early fusion result measured in concordance correlation coefficients score.
AB - Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to taking further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling, because it can enable deeper analysis of affective states. Experimental results show that this two-stage, late-fusion approach, obtains higher performance than that of any one-stage processing, with a linear correlation from one-stage to two-stage processing. This late-fusion approach improves previous early fusion result measured in concordance correlation coefficients score.
KW - Affective computing
KW - Automatic speech emotion recognition
KW - Bimodal fusion
KW - Dimensional emotion
KW - Late fusion
UR - http://www.scopus.com/inward/record.url?scp=85097345821&partnerID=8YFLogxK
U2 - 10.1016/j.specom.2020.11.003
DO - 10.1016/j.specom.2020.11.003
M3 - Article
AN - SCOPUS:85097345821
SN - 0167-6393
VL - 126
SP - 9
EP - 21
JO - Speech Communication
JF - Speech Communication
ER -