Auto-scaling a video-conference platform with Reinforcement learning

One of the capabilities that video-conferencing platforms are expected to have, as well as other distributed services, is being able to scale horizontally. This is because workload is not constant in a lot of applications, so setting a fixed number of servers beforehand will probably end up with eit...

Descripción completa

Detalles Bibliográficos
Autor: Roy Campderrós, Francesc
Tipo de recurso: tesis de maestría
Fecha de publicación:2021
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/360547
Acceso en línea:https://hdl.handle.net/2117/360547
Access Level:acceso abierto
Palabra clave:Reinforcement learning
Teleconferencees
Auto-scaling
Video-conference
Aprenentatge per reforç
Teleconferències
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
Descripción
Sumario:One of the capabilities that video-conferencing platforms are expected to have, as well as other distributed services, is being able to scale horizontally. This is because workload is not constant in a lot of applications, so setting a fixed number of servers beforehand will probably end up with either bad quality of service when load is too high, or resources wasted when load is too low. From the service providers's point of view both situations are undesirable. On the one side, they may be penalised when not delivering sufficient quality of service to their users. On the other side, having servers infra-used is inefficient, as more servers running imply higher electricity/renting costs. Therefore this auto-scaling capability is crucial in order to optimize the expenses at the end of the month. In this work we develop an auto-scaling algorithm based on Reinforcement learning (RL) to be applied to the adjustment of computing capacity of a distributed video-conference platform such as Jitsi and perform a comparison with simple threshold based methods (TBM), which are offered by many cloud providers as the default auto-scaling service. We perform this comparison under different synthetic workload patterns. Since video-conferencing platforms consume a lot of computing resources and we want to analyse different high loads, the comparison is done with simulations. We demonstrate that RL performs better than TBM in all the scenarios evaluated in terms of money expended (with different patterns tested) and that the difference between them is accentuated the more complex the workload pattern is.