Next-generation LLM inference: scalable, multimodal, and composable

Global interest in LLM research and development has sparked a race in both algorithmic and systems developments. The pre-training paradigm, brought forth by the scaling revolution, seems to be reaching the point of diminishing returns. As such, inference is becoming increasingly important, enabling...

Descripción completa

Detalles Bibliográficos
Autor: Antoñanzas Acero, Jesús Maria
Tipo de recurso: tesis de maestría
Fecha de publicación:2025
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/448586
Acceso en línea:https://hdl.handle.net/2117/448586
Access Level:acceso embargado
Palabra clave:Machine learning
Sistemes d'IA
Inferència amb LLMs
Intel·ligència artificial
Aprenentatge automàtic
Systems for AI
LLM inference
Artificial intelligence
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
Descripción
Sumario:Global interest in LLM research and development has sparked a race in both algorithmic and systems developments. The pre-training paradigm, brought forth by the scaling revolution, seems to be reaching the point of diminishing returns. As such, inference is becoming increasingly important, enabling the next generation of models: multi-modal reasoners with massive context lengths. In this work, we present a production-grade LLM inference system created for the coming generation of AI applications. Designed and built from the ground-up, it's uncompromisingly efficient and scalable, supporting extreme context lengths and dynamically allocating resources across 100s of GPUs. Featuring elegant abstractions, our system is robust yet simple and advances the state-of-the-art in inference systems with novel engineering solutions.