TY - GEN
T1 - Combination of DenseNet and BiLSTM Model for Indonesian Image Captioning
AU - Navastara, Dini Adni
AU - Ansori, Dwinanda Bagoes
AU - Suciati, Nanik
AU - Akbar, Zulfiqar Fauzul
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Humans can capture images of the surrounding environment by using camera. But the camera is not able to turn those images into representative information. From the image, feature extraction is carried out to get the objects in the image. These objects can be turned into information through image captioning. It takes a model that is trained with machine learning in order to transform a collection of objects into informative words. Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) is a model that can remember a collection of information that has been stored for a long time, while removing irrelevant information. The dataset used is flickr30k, and the original dataset was taken at several sidewalk points in Surabaya. Training conducted on the dataset will produce an image captioning model and will be tested using the BLEU score to test the degree of correspondence between the model caption and the original caption. The results showed that the best model was a model trained in Indonesian, feature extraction (encoder) using DenseNet-201, decoder using one layer LSTM and two layers BiLSTM with attention, tanh activation, and adam optimizer with BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores of 0.518, 0.320, 0.165, and 0.080, respectively.
AB - Humans can capture images of the surrounding environment by using camera. But the camera is not able to turn those images into representative information. From the image, feature extraction is carried out to get the objects in the image. These objects can be turned into information through image captioning. It takes a model that is trained with machine learning in order to transform a collection of objects into informative words. Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) is a model that can remember a collection of information that has been stored for a long time, while removing irrelevant information. The dataset used is flickr30k, and the original dataset was taken at several sidewalk points in Surabaya. Training conducted on the dataset will produce an image captioning model and will be tested using the BLEU score to test the degree of correspondence between the model caption and the original caption. The results showed that the best model was a model trained in Indonesian, feature extraction (encoder) using DenseNet-201, decoder using one layer LSTM and two layers BiLSTM with attention, tanh activation, and adam optimizer with BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores of 0.518, 0.320, 0.165, and 0.080, respectively.
KW - BLEU
KW - BiLSTM
KW - DenseNet
KW - Image Captioning
UR - http://www.scopus.com/inward/record.url?scp=85186497817&partnerID=8YFLogxK
U2 - 10.1109/ICAMIMIA60881.2023.10427729
DO - 10.1109/ICAMIMIA60881.2023.10427729
M3 - Conference contribution
AN - SCOPUS:85186497817
T3 - 2023 International Conference on Advanced Mechatronics, Intelligent Manufacture and Industrial Automation, ICAMIMIA 2023 - Proceedings
SP - 994
EP - 999
BT - 2023 International Conference on Advanced Mechatronics, Intelligent Manufacture and Industrial Automation, ICAMIMIA 2023 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 International Conference on Advanced Mechatronics, Intelligent Manufacture and Industrial Automation, ICAMIMIA 2023
Y2 - 14 November 2023 through 15 November 2023
ER -