Improvement of Tradition Dance Classification Process Using Video Vision Transformer based on Tubelet Embedding

Edy Mulyanto, Eko Mulyanto Yuniarno, Oddy Virgantara Putra, Isa Hafidz, Ardyono Priyadi, Mauridhi H. Purnomo*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Image processing has extensively addressed object detection, classification, clustering, and segmentation challenges. At the same time, the use of computers associated with complex video datasets spurred various strategies to classify videos automatically, particularly in detecting traditional dances. This research proposes advancement in classifying traditional dances by implementing a Video Vision Transformer (ViViT) that relies on tubelet embedding. The authors utilized IDEEH-10, a dataset of videos showcasing traditional dances. In addition, the ViViT artificial neural network model was used for video classification. The video representation is generated by projecting spatiotemporal tokens onto the transformer layer. Next, an embedding strategy is used to improve the classification accuracy of Traditional Dance Videos. The proposed concept treats video as a sequence of tubules mapped into tubule embeddings. Tubelet management has added TA (tubelet attention layer), CA (cross attention layer), and tubelet duration and scale management. From the test results, the proposed approach can better classify traditional dance videos compared to the LSTM, GRU, and RNN methods, with or without balancing data. Experimental results with 5 flods showed Loss between 0.003 to 0.011 with an average Lost of 0.0058. Experiments also produced an accuracy rate between 98.68 to 100 percent, resulting in an average accuracy of 99.216. This result is the best of several comparison methods. ViViT with tubeless embedding has a good level of accuracy with low losses, so that it can be used for dance video classification processes.

Original languageEnglish
Pages (from-to)530-545
Number of pages16
JournalInternational Journal of Intelligent Engineering and Systems
Volume17
Issue number4
DOIs
Publication statusPublished - 2024

Keywords

  • Tubelet embedding
  • Video classification
  • Video vision transformer

Fingerprint

Dive into the research topics of 'Improvement of Tradition Dance Classification Process Using Video Vision Transformer based on Tubelet Embedding'. Together they form a unique fingerprint.

Cite this