Algorithm for thematic analysis of digital documents

The objective of the article is to present an algorithm for assigning subject areas to digital documents which serve as a support tool for thematic analysis within the organization of information, in order to be implemented in development of controlled vocabularies. The methodology used consisted in...

Descripción completa

Detalles Bibliográficos
Autores: Polo Bautista, Luis Roberto, Martínez Acevedo, Karen Vanessa
Tipo de recurso: artículo
Estado:Versión publicada
Fecha de publicación:2021
País:México
Institución:UNIVERSIDAD NACIONAL AUTÓNOMA DE MÉXICO
Repositorio:Investigación Bibliotecológica: Archivonomía, Bibliotecología e Información
Idioma:español
OAI Identifier:oai:ojs.pkp.sfu.ca:article/58419
Acceso en línea:http://rev-ib.unam.mx/ib/index.php/ib/article/view/58419
Access Level:acceso abierto
Palabra clave:Latent Dirichlet Allocation
Algorithms
Thematic Analysis
Digital Documents
Asignación Latente de Dirichlet
Algoritmos
Análisis Temático
Documentos Digitales
Descripción
Sumario:The objective of the article is to present an algorithm for assigning subject areas to digital documents which serve as a support tool for thematic analysis within the organization of information, in order to be implemented in development of controlled vocabularies. The methodology used consisted in applying Optical Character Recognition (OCR) and Latent Dirichlet Allocation (LDA) as main tools for developing an algorithm based on Python programming language,which allows reading of files with a PDF extension in order to obtain the main themes of textual corpus. Results of the algorithm’s application demonstrate its usefulness in the area of indexing as a system for identifying and extracting relevant topics from a specific document in electronic format, and allow automation of processes by the information professional. This way, its use as a development of alternative points of access based on the content of texts is concluded.