Predicting valence and arousal by aggregating acoustic features for acoustic-linguistic information fusion

Bagus Tris Atmaja, Yasuhiro Hamada, Masato Akagi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Citations (Scopus)


This paper presents an evaluation of acoustic feature aggregation and acoustic-linguistic features combination for valence and arousal prediction within a speech. First, acoustic features were aggregated from chunk-based processing for story-based processing. We evaluated mean and maximum aggregation methods for those acoustic features and compared the results with the baseline, which used majority voting aggregation. Second, the extracted acoustic features are combined with linguistic features for predicting valence and arousal categories: low, medium, or high. The unimodal result using acoustic features aggregation showed an improvement over the baseline majority voting on development partition for the same acoustic feature set. The bimodal results (by combining acoustic and linguistic information at the feature level) improved both development and test scores over the official baseline. This combination of acoustic-linguistic information targeted speech-based applications where acoustic and linguistic features can be extracted from the sole speech modality.

Original languageEnglish
Title of host publication2020 IEEE Region 10 Conference, TENCON 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages5
ISBN (Electronic)9781728184555
Publication statusPublished - 16 Nov 2020
Event2020 IEEE Region 10 Conference, TENCON 2020 - Virtual, Osaka, Japan
Duration: 16 Nov 202019 Nov 2020

Publication series

NameIEEE Region 10 Annual International Conference, Proceedings/TENCON
ISSN (Print)2159-3442
ISSN (Electronic)2159-3450


Conference2020 IEEE Region 10 Conference, TENCON 2020
CityVirtual, Osaka


  • Affective computing
  • Arousal
  • Feature aggregation
  • Feature fusion
  • Valence


Dive into the research topics of 'Predicting valence and arousal by aggregating acoustic features for acoustic-linguistic information fusion'. Together they form a unique fingerprint.

Cite this