Phishing URL Detection: A Real-Case Scenario Through Login URLs
[EN] Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing webs...
| Autores: | , , , , , |
|---|---|
| Tipo de recurso: | artículo |
| Estado: | Versión actualizada desde la publicación |
| Fecha de publicación: | 2022 |
| País: | España |
| Institución: | Universidad de León |
| Repositorio: | BULERIA. Repositorio Institucional de la Universidad de León |
| OAI Identifier: | oai:buleria.unileon.es:10612/22944 |
| Acceso en línea: | https://ieeexplore.ieee.org/document/9759382 https://hdl.handle.net/10612/22944 |
| Access Level: | acceso abierto |
| Palabra clave: | Informática Ingeniería de sistemas Cybercrime Login Machine learning Phishing detection URL 1203.04 Inteligencia Artificial 1209.03 Análisis de Datos |
| Sumario: | [EN] Phishing is a social engineering cyberattack where criminals deceive users to obtain their credentials through a login form that submits the data to a malicious server. In this paper, we compare machine learning and deep learning techniques to present a method capable of detecting phishing websites through URL analysis. In most current state-of-the-art solutions dealing with phishing detection, the legitimate class is made up of homepages without including login forms. On the contrary, we use URLs from the login page in both classes because we consider it is much more representative of a real case scenario and we demonstrate that existing techniques obtain a high false-positive rate when tested with URLs from legitimate login pages. Additionally, we use datasets from different years to show how models decrease their accuracy over time by training a base model with old datasets and testing it with recent URLs. Also, we perform a frequency analysis over current phishing domains to identify different techniques carried out by phishers in their campaigns. To prove these statements, we have created a new dataset named Phishing Index Login URL (PILU-90K), which is composed of 60K legitimate URLs, including index and login websites, and 30K phishing URLs. Finally, we present a Logistic Regression model which, combined with Term Frequency - Inverse Document Frequency (TF-IDF) feature extraction, obtains 96.50% accuracy on the introduced login URL dataset. |
|---|