TY - GEN
T1 - Transformer-CNN Automatic Hyperparameter Tuning for Speech Emotion Recognition
AU - Gumelar, Agustinus Bimo
AU - Yuniarno, Eko Mulyanto
AU - Adi, Derry Pramono
AU - Setiawan, Rudi
AU - Sugiarto, Indar
AU - Purnomo, Mauridhi Hery
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Given the high number of hyperparameters in deep learning models, there is a need to tune automatically deep learning models in specific research cases. Deep learning models require hyperparameters because they substantially influence the model's behavior. As a result, optimizing any given model with a hyperparameter optimization technique will improve model efficiency significantly. This paper discusses the hyperparameter-optimized Speech Emotion Recognition (SER) research case using Transformer-CNN deep learning model. Each speech samples are transformed into spectrogram data using the RAVDESS dataset, which contains 1,536 speech samples (192 samples per eight emotion classes). We use the Gaussian Noise augmentation technique to reduce the overfitting problem in training data. After augmentation, the RAVDESS dataset yields a total of 2,400 emotional speech samples (300 samples per eight emotion classes). For SER model, we combine the Transformer and CNN for temporal and spatial speech feature processing. However, our Transformer-CNN must be thoroughly tested, as different hyperparameter settings result in varying accuracy performance. We experiment with Naive Bayes to optimize many hyperparameters of Transformer-CNN (it could be categorical or numerical), such as learning rate, dropouts, activation function, weight initialization, epoch, even the best split data scale of training and testing. Consequently, our automatically tuned Transformer-CNN achieves 97.3 % of accuracy.
AB - Given the high number of hyperparameters in deep learning models, there is a need to tune automatically deep learning models in specific research cases. Deep learning models require hyperparameters because they substantially influence the model's behavior. As a result, optimizing any given model with a hyperparameter optimization technique will improve model efficiency significantly. This paper discusses the hyperparameter-optimized Speech Emotion Recognition (SER) research case using Transformer-CNN deep learning model. Each speech samples are transformed into spectrogram data using the RAVDESS dataset, which contains 1,536 speech samples (192 samples per eight emotion classes). We use the Gaussian Noise augmentation technique to reduce the overfitting problem in training data. After augmentation, the RAVDESS dataset yields a total of 2,400 emotional speech samples (300 samples per eight emotion classes). For SER model, we combine the Transformer and CNN for temporal and spatial speech feature processing. However, our Transformer-CNN must be thoroughly tested, as different hyperparameter settings result in varying accuracy performance. We experiment with Naive Bayes to optimize many hyperparameters of Transformer-CNN (it could be categorical or numerical), such as learning rate, dropouts, activation function, weight initialization, epoch, even the best split data scale of training and testing. Consequently, our automatically tuned Transformer-CNN achieves 97.3 % of accuracy.
KW - Automatic Hyperparameter Tuning
KW - Naive Bayes Optimization
KW - Speech Emotion Recognition
KW - Transformer-CNN
UR - http://www.scopus.com/inward/record.url?scp=85135913770&partnerID=8YFLogxK
U2 - 10.1109/IST55454.2022.9827732
DO - 10.1109/IST55454.2022.9827732
M3 - Conference contribution
AN - SCOPUS:85135913770
T3 - IST 2022 - IEEE International Conference on Imaging Systems and Techniques, Proceedings
BT - IST 2022 - IEEE International Conference on Imaging Systems and Techniques, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Conference on Imaging Systems and Techniques, IST 2022
Y2 - 21 June 2022 through 23 June 2022
ER -