Skip to main navigation Skip to search Skip to main content

DiLLeMa: An extensible and scalable framework for distributed large language models (LLMs) inference on multi-GPU clusters

  • Institut Teknologi Sepuluh Nopember

Research output: Contribution to journalArticlepeer-review

Abstract

The increasing demand for scalable and responsive Large Language Model (LLM) applications has accelerated the need for distributed inference systems capable of handling high concurrency and heterogeneous GPU resources. This paper introduces DiLLeMa, an extensible framework for distributed LLM deployment on multi-GPU clusters, designed to improve inference efficiency through workload parallelization and adaptive resource management. Built upon the Ray distributed computing framework, DiLLeMa orchestrates LLM inference across multiple nodes while maintaining balanced GPU utilization and low-latency response. The system integrates a FastAPI -based backend for coordination and API management, a React -based frontend for interactive access, and a vLLM inference engine optimized for high-throughput execution. Complementary modules for data preprocessing, semantic embedding, and vector-based retrieval further enhance contextual relevance during response generation. Illustrative examples demonstrate that DiLLeMa effectively reduces inference latency and scales efficiently.

Original languageEnglish
Article number102537
JournalSoftwareX
Volume33
DOIs
Publication statusPublished - Feb 2026

Keywords

  • Chatbot systems
  • Distributed computing
  • GPU clusters
  • Large language models (LLMs)

Fingerprint

Dive into the research topics of 'DiLLeMa: An extensible and scalable framework for distributed large language models (LLMs) inference on multi-GPU clusters'. Together they form a unique fingerprint.

Cite this