Abstract
The increasing demand for scalable and responsive Large Language Model (LLM) applications has accelerated the need for distributed inference systems capable of handling high concurrency and heterogeneous GPU resources. This paper introduces DiLLeMa, an extensible framework for distributed LLM deployment on multi-GPU clusters, designed to improve inference efficiency through workload parallelization and adaptive resource management. Built upon the Ray distributed computing framework, DiLLeMa orchestrates LLM inference across multiple nodes while maintaining balanced GPU utilization and low-latency response. The system integrates a FastAPI -based backend for coordination and API management, a React -based frontend for interactive access, and a vLLM inference engine optimized for high-throughput execution. Complementary modules for data preprocessing, semantic embedding, and vector-based retrieval further enhance contextual relevance during response generation. Illustrative examples demonstrate that DiLLeMa effectively reduces inference latency and scales efficiently.
| Original language | English |
|---|---|
| Article number | 102537 |
| Journal | SoftwareX |
| Volume | 33 |
| DOIs | |
| Publication status | Published - Feb 2026 |
Keywords
- Chatbot systems
- Distributed computing
- GPU clusters
- Large language models (LLMs)
Fingerprint
Dive into the research topics of 'DiLLeMa: An extensible and scalable framework for distributed large language models (LLMs) inference on multi-GPU clusters'. Together they form a unique fingerprint.Press/Media
-
Researchers from Institut Teknologi Sepuluh Nopember Report on Findings in Computer Software [DiLLeMa: An extensible and scalable framework for distributed large language models (LLMs) inference on multi-GPU clusters]
Shiddiqi, A. M., Navastara, D. A., Akbar, R. J. & Ijtihadie, R. M.
16/02/26
1 item of Media coverage
Press/Media
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver