A new methodology for datascience automation in javaKlass: PTDD

Data science research is a multidisciplinary activity where people with different backgrounds and skills (mathematicians, physicists, computer scientists, etc.) often work together to design and build software that implements research results. Long-term projects face stability, maintainability, scal...

Descripción completa

Detalles Bibliográficos
Autor: Varela Agrelo, Jordi
Tipo de recurso: tesis de maestría
Fecha de publicación:2023
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:español
OAI Identifier:oai:upcommons.upc.edu:2117/400523
Acceso en línea:https://hdl.handle.net/2117/400523
Access Level:acceso abierto
Palabra clave:Computer software
Java (Computer program language)
Data sets
Programari
Java (Llenguatge de programació)
Conjunts de dades
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
Descripción
Sumario:Data science research is a multidisciplinary activity where people with different backgrounds and skills (mathematicians, physicists, computer scientists, etc.) often work together to design and build software that implements research results. Long-term projects face stability, maintainability, scalability, and reproducibility challenges when a large number of developers are involved, when very old code coexists with new code, and when the development team faces high volatility due to the inherent nature of research teams, often related to financial issues. JavaKLASS is a data science software developed after 30 years of research under the leadership of Karina Gibert and her team of more than 25 researchers and developers from different backgrounds. The system is a Java desktop application that needs to evolve to a new version where it can be used from different interfaces, including batch usage. In this thesis, we will design and build an end-to-end scripting language for javaKLASS, so that scripts can be used to execute the various data science processes supported by javaKLASS in different ways: either called from the current javaKLASS graphical interface, or from a batch process. By implementing this scripting language, we’re also opening the door to defining a set of scripts that can also be used to test intensively the stability of the code as new developers extend the functionality of the system. These test scripts will provide a mechanism for a comprehensive and automated testing process, introducing a new methodology we’ll call Process Testing Driven Development (PTDD). This new methodology is intended to ensure that new developments do not break existing functionality and to add robustness to future developments and software upgrades. These tests will also be used in the long term to support software refactoring activities.