Evaluating the impact of task aggregation in workflows with shared resources environments

We study the relative impact of task aggregation, or wrapping, which is a technique meant for computational workflows that bundles jobs into a single submission to be sent to remote schedulers. Experiments inside the Earth Science community can be lengthy and compriseseveral steps with many dependen...

Descripción completa

Detalles Bibliográficos
Autor: Giménez de Castro Marciani, Manuel|||0000-0002-9852-3322
Tipo de recurso: tesis de maestría
Fecha de publicación:2023
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/404041
Acceso en línea:https://hdl.handle.net/2117/404041
Access Level:acceso abierto
Palabra clave:Workflow -- Management
Earth sciences
computació d'altes prestacions
ciències de la terra
fluxes de treball
ambients de recursos compartits
agregació de tasques
batch scheduling
sistema de cues
computació científica
high performance computing
earth sciences
workflows
shared resources environments
task aggregation
queuing systems
scientific computing
Cicle de treball -- Gestió
Ciències de la terra
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
Descripción
Sumario:We study the relative impact of task aggregation, or wrapping, which is a technique meant for computational workflows that bundles jobs into a single submission to be sent to remote schedulers. Experiments inside the Earth Science community can be lengthy and compriseseveral steps with many dependencies. The community has traditionally focused in increasing the performance of the models, but the overall execution of the workflow, including the queue time, has received little interest. Aiming to reduce the time spent in queue, the developers of Autosubmit, a workflow manager developed for climate simulations, weather forecast simulations, and air quality simulations, came up with task aggregating, or wrapping. Our objective is to assess if this technique does indeed reduce the total queue time of the workflow. The complex interplay between the dynamic nature of the usage of the machine and the scheduler policy plays a central role in our analysis, which poses the main challenge of this work. Hence, we do an intricate study of the scheduling policy of the popular Slurm Workload Manager and a statistical characterization of the usage of both simulated machines: LUMI and cea-curie. With that, we perform a twofold experimentation: a simulation using dynamic workloads - where job arrival time plays a role - with a workflow composed of multiple jobs and a static workload - where all jobs in the workload are submitted at the same time - varying job and user factors that play a role into the scheduling. Results show that aggregation is beneficial in the majority of cases for the workflows that are vertically organized - that is, a chain of submissions where each job is dependent on the previous -, whilst for the horizontal arranged workflows - where jobs do not have dependencies - it might undermine the queue time depending on the user's past usage and the machine's current state.