Preserving Sasak Dialectal Features in English to Sasak Machine Translation through Locked Tokenization with Transformer Models

Arik Aranta, Arif Djunaidy*, Nanik Suciati

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper discusses the development of a translation engine for the Sasak language, a low-resource language with various dialects such as Kuto-Kete, Ngento-Ngente, Meno-Mene, Ngeno-Ngene, and Meriak-Meriku. Currently, the development of translation machines fails to preserve Sasak dialects, leading to outputs that lack fluency. Preserving the uniqueness of the Sasak dialects creates its own challenges in translation due to the diversity of dialects, thus requiring complex dataset variations. Sasak was chosen as the study language due to the significant dialect variation on a relatively small island and its potential as an example for similar issues in Indonesia. Translation machines that use Transformer and sequence-to-sequence models to address language translation challenges are appropriate and widely used solutions, but this can lead to inconsistency in the output dialects. Therefore, a method is needed that can maintain dialect consistency in the translation process. This study involves the creation of a transformer model for translating English into Sasak, with the addition of a lock tokenization method aimed at preserving the characteristics of the dialect in the regional language used as the output by the translation machine. This process includes the collection and creation of a dataset that reflects the variations of the Sasak dialects, as well as the development of an algorithm that can recognize and maintain the unique linguistic features of each dialect. This study successfully recorded 105,327 total pairs of English-Indonesian words, achieving a total average validation accuracy (val-accuracy) of 0.8562 in English-Indonesian translation cases and 0.8408 in Sasak translation cases. These findings show that the use of lock tokenization can improve translation accuracy and contextual relevance, making a significant contribution to the development of translation machines capable of handling languages with multiple dialects.

Original languageEnglish
Title of host publication2024 International Seminar on Intelligent Technology and Its Applications
Subtitle of host publicationCollaborative Innovation: A Bridging from Academia to Industry towards Sustainable Strategic Partnership, ISITIA 2024 - Proceeding
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages19-24
Number of pages6
Edition2024
ISBN (Electronic)9798350378573
DOIs
Publication statusPublished - 2024
Event25th International Seminar on Intelligent Technology and Its Applications, ISITIA 2024 - Hybrid, Mataram, Indonesia
Duration: 10 Jul 202412 Jul 2024

Conference

Conference25th International Seminar on Intelligent Technology and Its Applications, ISITIA 2024
Country/TerritoryIndonesia
CityHybrid, Mataram
Period10/07/2412/07/24

Keywords

  • linguistic diversity
  • low-resource language
  • semantic adaptation
  • sequence-to-sequence model
  • transformer model
  • translation engine

Fingerprint

Dive into the research topics of 'Preserving Sasak Dialectal Features in English to Sasak Machine Translation through Locked Tokenization with Transformer Models'. Together they form a unique fingerprint.

Cite this