TY - GEN
T1 - Emotion Recognition from Video Frame Sequence using Face Mesh and Pre-Trained Models of Convolutional Neural Network
AU - Adi, Derry Pramono
AU - Yuniarno, Eko Mulyanto
AU - Wulandari, Diah Puspito
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe's face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of 'Anger,' 'Fearful,' 'Happy,' and 'Sad.' For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of ∼2.3 seconds (achieved using the SqueezeNet model).
AB - Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe's face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of 'Anger,' 'Fearful,' 'Happy,' and 'Sad.' For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of ∼2.3 seconds (achieved using the SqueezeNet model).
KW - Convolutional Neural Network
KW - Face Mesh
KW - Facial Emotion Recognition
KW - Video Frame Sequence
UR - http://www.scopus.com/inward/record.url?scp=85171190757&partnerID=8YFLogxK
U2 - 10.1109/ISITIA59021.2023.10221117
DO - 10.1109/ISITIA59021.2023.10221117
M3 - Conference contribution
AN - SCOPUS:85171190757
T3 - 2023 International Seminar on Intelligent Technology and Its Applications: Leveraging Intelligent Systems to Achieve Sustainable Development Goals, ISITIA 2023 - Proceeding
SP - 353
EP - 358
BT - 2023 International Seminar on Intelligent Technology and Its Applications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th International Seminar on Intelligent Technology and Its Applications, ISITIA 2023
Y2 - 26 July 2023 through 27 July 2023
ER -