Enriching Process-Related UI Logs Via Screenshot-Based Activity Labeling by Using Vision-Language Models

Robotic Process Mining (RPM) leverages User Interface (UI) logs as a source of information to analyze the processes which are to be automated. The UI logs keep a record of user interactions with the graphical UI of an information system during the execution of a process, encapsulating a large amount...

Descripción completa

Detalles Bibliográficos
Autores: Rodríguez Ruíz, Antonio, Martínez Rojas, Antonio, González Enríquez, José, Jiménez Ramírez, Andrés, Agostinelli, S.
Tipo de recurso: artículo
Estado:Versión publicada
Fecha de publicación:2026
País:España
Institución:Universidad de Sevilla (US)
Repositorio:idUS. Depósito de Investigación de la Universidad de Sevilla
OAI Identifier:oai:dnet:idus________::f8b9b69b9a21119f010df07b428aeb1c
Acceso en línea:https://hdl.handle.net/11441/186461
https://doi.org/10.1007/s12599-026-00990-6
Access Level:acceso abierto
Palabra clave:User interface
UI hierarchy
Screenshotbased
Vision-language models
Activity labeling
Task Mining
Robotic process automation
Descripción
Sumario:Robotic Process Mining (RPM) leverages User Interface (UI) logs as a source of information to analyze the processes which are to be automated. The UI logs keep a record of user interactions with the graphical UI of an information system during the execution of a process, encapsulating a large amount of data. Prior research has proposed methods to interpret the UI logs by exploiting the structured information available on-screen (e.g., the DOM tree of a Web page) which makes the analysts’ interpretation of the processes behind the logs easier. However, in environments where such structured information is not available (e.g., in virtualized environments), understanding user actions and high-level activities via the elements that the users interact with poses a challenge that remains unsolved. This limitation hinders the application of RPM techniques in these environments, thereby requiring human intervention to analyze and understand the actions carried out within these UI logs. To address this challenge, the authors propose a framework that leverages screenshot-based techniques to generate semantic descriptions of user actions and enable us to generate accurate descriptions of high-level activities by solely relying on the information available in the UI logs. In an organizational context, this approach enables RPA analysts and process managers to analyze user interaction logs and improve the understanding of the candidate business processes for automation. We evaluate our approach using a manually-labeled dataset of screenshots from realistic desktop applications. Our results demonstrate that the method can effectively generate semantic descriptions of user actions which, in turn, enable more precise descriptions of the high-level activities carried out by the user.