TY - GEN
T1 - Empowering the Independence of the Visually Impaired using Vision-Language Models
AU - Hidayati, Qory
AU - Kusuma, Hendra
AU - Attamimi, Muhammad
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Visually impaired individuals often face difficulties in accessing printed text, due to limited braille materials and costly assistive technologies. To address these challenges, this study proposes a real-time assistive system based on a Vision-Language Model (VLM), specifically LLaMA 3.2-90B-Vision, that enables automatic extraction and vocalization of textual content from images. The system integrates image description, Optical Character Recognition (OCR), and Text-to-Speech (TTS) components to convert visual information into speech output. Implemented on a high-performance environment with an Intel Core i5 processor and NVIDIA GeForce RTX 2050, and using a Logitech C310 HD webcam for image capture, the system ensures fast and accurate processing. Evaluation results show a faithfulness score of 0.926, precision of 0.938, answer correctness of 0.870, and context recall of 0.914, confirming the system’s reliability in varied environmental conditions. Comparative evaluations with baseline systems such as Tesseract+TTS, and BLIP-2 demonstrate the superiority of the proposed system in terms of transcription accuracy and contextual understanding, particularly due to its closed-loop validation mechanism. While the system shows promising results in simulation, its performance in real-world deployment remains to be validated. Future work includes incorporating multilingual support, automatic language detection, and deployment on mobile platforms.
AB - Visually impaired individuals often face difficulties in accessing printed text, due to limited braille materials and costly assistive technologies. To address these challenges, this study proposes a real-time assistive system based on a Vision-Language Model (VLM), specifically LLaMA 3.2-90B-Vision, that enables automatic extraction and vocalization of textual content from images. The system integrates image description, Optical Character Recognition (OCR), and Text-to-Speech (TTS) components to convert visual information into speech output. Implemented on a high-performance environment with an Intel Core i5 processor and NVIDIA GeForce RTX 2050, and using a Logitech C310 HD webcam for image capture, the system ensures fast and accurate processing. Evaluation results show a faithfulness score of 0.926, precision of 0.938, answer correctness of 0.870, and context recall of 0.914, confirming the system’s reliability in varied environmental conditions. Comparative evaluations with baseline systems such as Tesseract+TTS, and BLIP-2 demonstrate the superiority of the proposed system in terms of transcription accuracy and contextual understanding, particularly due to its closed-loop validation mechanism. While the system shows promising results in simulation, its performance in real-world deployment remains to be validated. Future work includes incorporating multilingual support, automatic language detection, and deployment on mobile platforms.
KW - Assistive Technology
KW - Optical Character Recognition (OCR)
KW - Text-to-Speech (TTS)
KW - Vision Language Model (VLM)
KW - Visual Impaired
KW - accessibility
UR - https://www.scopus.com/pages/publications/105032054861
U2 - 10.1109/ICITDA68167.2025.11332350
DO - 10.1109/ICITDA68167.2025.11332350
M3 - Conference contribution
AN - SCOPUS:105032054861
T3 - Proceedings of ICITDA 2025 - 10th International Conference on Information Technology and Digital Application
BT - Proceedings of ICITDA 2025 - 10th International Conference on Information Technology and Digital Application
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 10th International Conference on Information Technology and Digital Applications, ICITDA 2025
Y2 - 6 November 2025 through 7 November 2025
ER -