Phishing URL Detection: A Real-Case Scenario Through Login URLs

[EN] Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing webs...

Descripción completa

Detalles Bibliográficos
Autores: Sanchez Paniagua, Manuel, Fidalgo Fernández, Eduardo, Alegre Gutiérrez, Enrique, Al Nabki, Mohamed Wesam, González Castro, Víctor, Sánchez Paniagua
Tipo de recurso: artículo
Estado:Versión actualizada desde la publicación
Fecha de publicación:2022
País:España
Institución:Universidad de León
Repositorio:BULERIA. Repositorio Institucional de la Universidad de León
OAI Identifier:oai:buleria.unileon.es:10612/22944
Acceso en línea:https://ieeexplore.ieee.org/document/9759382
https://hdl.handle.net/10612/22944
Access Level:acceso abierto
Palabra clave:Informática
Ingeniería de sistemas
Cybercrime
Login
Machine learning
Phishing detection
URL
1203.04 Inteligencia Artificial
1209.03 Análisis de Datos
Descripción
Sumario:[EN] Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing websites through URL analysis. In most current state-of-the-art solutions dealing with phishing detection, the legitimate class is made up of homepages without including login forms. On the contrary, we use URLs from the login page in both classes because we consider it is much more representative of a real case scenario and we demonstrate that existing techniques obtain a high false-positive rate when tested with URLs from legitimate login pages. Additionally, we use datasets from different years to show how models decrease their accuracy over time by training a base model with old datasets and testing it with recent URLs. Also, we perform a frequency analysis over current phishing domains to identify different techniques carried out by phishers in their campaigns. To prove these statements, we have created a new dataset named Phishing Index Login URL (PILU-90K), which is composed of 60K legitimate URLs, including index and login websites, and 30K phishing URLs. Finally, we present a Logistic Regression model which, combined with Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction, obtains 96.50% accuracy on the introduced login URL dataset.