TY - JOUR
T1 - Improvement of Tradition Dance Classification Process Using Video Vision Transformer based on Tubelet Embedding
AU - Mulyanto, Edy
AU - Yuniarno, Eko Mulyanto
AU - Putra, Oddy Virgantara
AU - Hafidz, Isa
AU - Priyadi, Ardyono
AU - Purnomo, Mauridhi H.
N1 - Publisher Copyright:
© (2024), (Intelligent Network and Systems Society). All rights reserved.
PY - 2024
Y1 - 2024
N2 - Image processing has extensively addressed object detection, classification, clustering, and segmentation challenges. At the same time, the use of computers associated with complex video datasets spurred various strategies to classify videos automatically, particularly in detecting traditional dances. This research proposes advancement in classifying traditional dances by implementing a Video Vision Transformer (ViViT) that relies on tubelet embedding. The authors utilized IDEEH-10, a dataset of videos showcasing traditional dances. In addition, the ViViT artificial neural network model was used for video classification. The video representation is generated by projecting spatiotemporal tokens onto the transformer layer. Next, an embedding strategy is used to improve the classification accuracy of Traditional Dance Videos. The proposed concept treats video as a sequence of tubules mapped into tubule embeddings. Tubelet management has added TA (tubelet attention layer), CA (cross attention layer), and tubelet duration and scale management. From the test results, the proposed approach can better classify traditional dance videos compared to the LSTM, GRU, and RNN methods, with or without balancing data. Experimental results with 5 flods showed Loss between 0.003 to 0.011 with an average Lost of 0.0058. Experiments also produced an accuracy rate between 98.68 to 100 percent, resulting in an average accuracy of 99.216. This result is the best of several comparison methods. ViViT with tubeless embedding has a good level of accuracy with low losses, so that it can be used for dance video classification processes.
AB - Image processing has extensively addressed object detection, classification, clustering, and segmentation challenges. At the same time, the use of computers associated with complex video datasets spurred various strategies to classify videos automatically, particularly in detecting traditional dances. This research proposes advancement in classifying traditional dances by implementing a Video Vision Transformer (ViViT) that relies on tubelet embedding. The authors utilized IDEEH-10, a dataset of videos showcasing traditional dances. In addition, the ViViT artificial neural network model was used for video classification. The video representation is generated by projecting spatiotemporal tokens onto the transformer layer. Next, an embedding strategy is used to improve the classification accuracy of Traditional Dance Videos. The proposed concept treats video as a sequence of tubules mapped into tubule embeddings. Tubelet management has added TA (tubelet attention layer), CA (cross attention layer), and tubelet duration and scale management. From the test results, the proposed approach can better classify traditional dance videos compared to the LSTM, GRU, and RNN methods, with or without balancing data. Experimental results with 5 flods showed Loss between 0.003 to 0.011 with an average Lost of 0.0058. Experiments also produced an accuracy rate between 98.68 to 100 percent, resulting in an average accuracy of 99.216. This result is the best of several comparison methods. ViViT with tubeless embedding has a good level of accuracy with low losses, so that it can be used for dance video classification processes.
KW - Tubelet embedding
KW - Video classification
KW - Video vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85199701184&partnerID=8YFLogxK
U2 - 10.22266/IJIES2024.0831.41
DO - 10.22266/IJIES2024.0831.41
M3 - Article
AN - SCOPUS:85199701184
SN - 2185-310X
VL - 17
SP - 530
EP - 545
JO - International Journal of Intelligent Engineering and Systems
JF - International Journal of Intelligent Engineering and Systems
IS - 4
ER -