Development of under-resourced Bahasa Indonesia speech corpus

Elok Cahyaningtyas, Dhany Arifianto

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Citations (Scopus)

Abstract

Although Bahasa Indonesia is used by about 263 milion people in the world, it is calssified into an under-resourced language. In this paper we outlined the development of casual sentences of Bahasa Indonesia speech corpus in which contains a speech database and its transcription. Firstly, we selected casual Bahasa Indonesia sentences from movie and drama trasncript and formed 1029 declarative sentences and 500 question sentences, respectively. We hired six professional radio news readers to utter the sentences to avoid local dialect in sound-proof booth. Then segmentation and labeling was performed to make create transcription including the time label of each invidual phoneme. To ensure the quality of the database, we manually inspected the waveform and the frequency of the individual sentences using spectrogram. The results suggest that the speech corpus may be used for speech processing project like speech recognition and speech synthesis. In the on-going research, we are developing high quality of speech synthesis, namely speaker adaptation and speaker averaging.

Original languageEnglish
Title of host publicationProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1097-1101
Number of pages5
ISBN (Electronic)9781538615423
DOIs
Publication statusPublished - 2 Jul 2017
Event9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 - Kuala Lumpur, Malaysia
Duration: 12 Dec 201715 Dec 2017

Publication series

NameProceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Volume2018-February

Conference

Conference9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Country/TerritoryMalaysia
CityKuala Lumpur
Period12/12/1715/12/17

Keywords

  • Bahasa Indonesia
  • labeling
  • segmentation
  • speech corpus
  • under-resourced language

Fingerprint

Dive into the research topics of 'Development of under-resourced Bahasa Indonesia speech corpus'. Together they form a unique fingerprint.

Cite this