Skip to main navigation Skip to search Skip to main content

Empowering the Independence of the Visually Impaired using Vision-Language Models

  • Institut Teknologi Sepuluh Nopember

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visually impaired individuals often face difficulties in accessing printed text, due to limited braille materials and costly assistive technologies. To address these challenges, this study proposes a real-time assistive system based on a Vision-Language Model (VLM), specifically LLaMA 3.2-90B-Vision, that enables automatic extraction and vocalization of textual content from images. The system integrates image description, Optical Character Recognition (OCR), and Text-to-Speech (TTS) components to convert visual information into speech output. Implemented on a high-performance environment with an Intel Core i5 processor and NVIDIA GeForce RTX 2050, and using a Logitech C310 HD webcam for image capture, the system ensures fast and accurate processing. Evaluation results show a faithfulness score of 0.926, precision of 0.938, answer correctness of 0.870, and context recall of 0.914, confirming the system’s reliability in varied environmental conditions. Comparative evaluations with baseline systems such as Tesseract+TTS, and BLIP-2 demonstrate the superiority of the proposed system in terms of transcription accuracy and contextual understanding, particularly due to its closed-loop validation mechanism. While the system shows promising results in simulation, its performance in real-world deployment remains to be validated. Future work includes incorporating multilingual support, automatic language detection, and deployment on mobile platforms.

Original languageEnglish
Title of host publicationProceedings of ICITDA 2025 - 10th International Conference on Information Technology and Digital Application
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331594039
DOIs
Publication statusPublished - 2025
Event2025 10th International Conference on Information Technology and Digital Applications, ICITDA 2025 - Yogyakarta, Indonesia
Duration: 6 Nov 20257 Nov 2025

Publication series

NameProceedings of ICITDA 2025 - 10th International Conference on Information Technology and Digital Application

Conference

Conference2025 10th International Conference on Information Technology and Digital Applications, ICITDA 2025
Country/TerritoryIndonesia
CityYogyakarta
Period6/11/257/11/25

Keywords

  • Assistive Technology
  • Optical Character Recognition (OCR)
  • Text-to-Speech (TTS)
  • Vision Language Model (VLM)
  • Visual Impaired
  • accessibility

Fingerprint

Dive into the research topics of 'Empowering the Independence of the Visually Impaired using Vision-Language Models'. Together they form a unique fingerprint.

Cite this