Appled NLP and ML for the detection of inappropiarte text in a communications platform
With the expansion of the communication platforms, individuals are very used to exchange information and communicate with other people through these platforms, both for social and business purposes. It’s a known problem, that many people use the anonymity provided by these platforms to use inappropr...
| Autor: | |
|---|---|
| Tipo de recurso: | tesis de maestría |
| Fecha de publicación: | 2020 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/336082 |
| Acceso en línea: | https://hdl.handle.net/2117/336082 |
| Access Level: | acceso abierto |
| Palabra clave: | Machine learning Natural language processing (Computer science) Natural Language Processing Deep Learning Word Embedding Model Ensemble Aprenentatge automàtic Tractament del llenguatge natural (Informàtica) Àrees temàtiques de la UPC::Informàtica |
| Sumario: | With the expansion of the communication platforms, individuals are very used to exchange information and communicate with other people through these platforms, both for social and business purposes. It’s a known problem, that many people use the anonymity provided by these platforms to use inappropriate and offensive language. The company NextreT S.L. has built a communication platform directed to business use. As the company wants to avoid offensive language in this platform, natural language processing tools in big data environment are going to be used to analyze each written text to detect and remove if required this inappropriate and offensive language. During this project, a variety of techniques used in the State of the Art are analyzed, compared and then tested using a completely new data set in Spanish language created using Tellfy App and Twitter corpus. Initially, different word encoding methods are tested, including word embedding like Word2Vec and FastText. In addition, different hyper parameter configurations are checked as well as model performances with different data sizes. Finally, after a forward feature selection phase, model ensemble techniques are tested. During these tests, it has been shown that the combination of the features that are used is very important to increase the performance of the models. Also, the different word representation techniques are very related to the performance of the models. Furthermore, the sizes of the training sets that are used need to be as representative and as large as possible. Finally, after using different complex Deep Neural Network models, more traditional Logistic Regression models can offer a better performance. |
|---|