Next-generation LLM inference: scalable, multimodal, and composable
Global interest in LLM research and development has sparked a race in both algorithmic and systems developments. The pre-training paradigm, brought forth by the scaling revolution, seems to be reaching the point of diminishing returns. As such, inference is becoming increasingly important, enabling...
| Autor: | |
|---|---|
| Tipo de recurso: | tesis de maestría |
| Fecha de publicación: | 2025 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/448586 |
| Acceso en línea: | https://hdl.handle.net/2117/448586 |
| Access Level: | acceso embargado |
| Palabra clave: | Machine learning Sistemes d'IA Inferència amb LLMs Intel·ligència artificial Aprenentatge automàtic Systems for AI LLM inference Artificial intelligence Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic |
| Sumario: | Global interest in LLM research and development has sparked a race in both algorithmic and systems developments. The pre-training paradigm, brought forth by the scaling revolution, seems to be reaching the point of diminishing returns. As such, inference is becoming increasingly important, enabling the next generation of models: multi-modal reasoners with massive context lengths. In this work, we present a production-grade LLM inference system created for the coming generation of AI applications. Designed and built from the ground-up, it's uncompromisingly efficient and scalable, supporting extreme context lengths and dynamically allocating resources across 100s of GPUs. Featuring elegant abstractions, our system is robust yet simple and advances the state-of-the-art in inference systems with novel engineering solutions. |
|---|