Appled NLP and ML for the detection of inappropiarte text in a communications platform

With the expansion of the communication platforms, individuals are very used to exchange information and communicate with other people through these platforms, both for social and business purposes. It’s a known problem, that many people use the anonymity provided by these platforms to use inappropr...

Descripción completa

Detalles Bibliográficos
Autor: Urrutia Zubikarai, Aitor
Tipo de recurso: tesis de maestría
Fecha de publicación:2020
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/336082
Acceso en línea:https://hdl.handle.net/2117/336082
Access Level:acceso abierto
Palabra clave:Machine learning
Natural language processing (Computer science)
Natural Language Processing
Deep Learning
Word Embedding
Model Ensemble
Aprenentatge automàtic
Tractament del llenguatge natural (Informàtica)
Àrees temàtiques de la UPC::Informàtica
Descripción
Sumario:With the expansion of the communication platforms, individuals are very used to exchange information and communicate with other people through these platforms, both for social and business purposes. It’s a known problem, that many people use the anonymity provided by these platforms to use inappropriate and offensive language. The company NextreT S.L. has built a communication platform directed to business use. As the company wants to avoid offensive language in this platform, natural language processing tools in big data environment are going to be used to analyze each written text to detect and remove if required this inappropriate and offensive language. During this project, a variety of techniques used in the State of the Art are analyzed, compared and then tested using a completely new data set in Spanish language created using Tellfy App and Twitter corpus. Initially, different word encoding methods are tested, including word embedding like Word2Vec and FastText. In addition, different hyper parameter configurations are checked as well as model performances with different data sizes. Finally, after a forward feature selection phase, model ensemble techniques are tested. During these tests, it has been shown that the combination of the features that are used is very important to increase the performance of the models. Also, the different word representation techniques are very related to the performance of the models. Furthermore, the sizes of the training sets that are used need to be as representative and as large as possible. Finally, after using different complex Deep Neural Network models, more traditional Logistic Regression models can offer a better performance.