Application of machine learning methods to predict phytoplankton blooms and determine microbial biomarkers using marine microbiomes
Understanding the relationship between bacterioplankton and coastal phytoplankton blooms is key to understand coastal ecosystems functioning, which are the most productive areas for fisheries. With that knowledge, we could predict and may be mitigate, the effects of global change or contamination ev...
| Autor: | |
|---|---|
| Tipo de recurso: | tesis de maestría |
| Fecha de publicación: | 2023 |
| País: | España |
| Institución: | Universitat Oberta de Catalunya (UOC) |
| Repositorio: | O2, repositorio institucional de la UOC |
| OAI Identifier: | oai:openaccess.uoc.edu:10609/149868 |
| Acceso en línea: | http://hdl.handle.net/10609/149868 |
| Access Level: | acceso abierto |
| Palabra clave: | coastal blooms biomarkers random forest Machine learning -- TFM Aprenentatge automàtic -- TFM |
| Sumario: | Understanding the relationship between bacterioplankton and coastal phytoplankton blooms is key to understand coastal ecosystems functioning, which are the most productive areas for fisheries. With that knowledge, we could predict and may be mitigate, the effects of global change or contamination events in these productive ecosystems. However, these microbial communities are governed by very complex relationships. In addition, the data used to study bacterioplankton diversity (Amplicon Sequence Variants of 16S rRNA gene) is highly dimensional, sparse, and noisy. In this project, Random Forest classifiers based on diversity data were used to predict coastal phytoplankton blooms and search for their biomarkers. After joining two oceanographic campaigns data, samples were classified as bloom or normal depending on the total chlorophyll concentrations. The resulting dataset was highly dimensional (166 instances, 7593 features) and imbalanced (31 instances bloom, 135 – normal). To reduce dimensionality, biological features with relative abundances below 0.01 were removed, or they were grouped into clusters at genus level. Random forest models were trained and tuned with a grid-search of the number of features included in the individual trees. The process was repeated using one hundred different data splits into train and test groups to ensure results’ representativity. Good performance values (kappa, sensitivity, and specificity > 0.8) were achieved only after using the synthetic minority oversampling technique to level the number of instances between the two categories. Using those models, the topmost important features, according to the predictive error rate of features, were selected as biomarkers. |
|---|