Application of machine learning methods to predict phytoplankton blooms and determine microbial biomarkers using marine microbiomes

Understanding the relationship between bacterioplankton and coastal phytoplankton blooms is key to understand coastal ecosystems functioning, which are the most productive areas for fisheries. With that knowledge, we could predict and may be mitigate, the effects of global change or contamination ev...

Descripción completa

Detalles Bibliográficos
Autor: Fernandez-Gonzalez, Nuria
Tipo de recurso: tesis de maestría
Fecha de publicación:2023
País:España
Institución:Universitat Oberta de Catalunya (UOC)
Repositorio:O2, repositorio institucional de la UOC
OAI Identifier:oai:openaccess.uoc.edu:10609/149868
Acceso en línea:http://hdl.handle.net/10609/149868
Access Level:acceso abierto
Palabra clave:coastal blooms
biomarkers
random forest
Machine learning -- TFM
Aprenentatge automàtic -- TFM
Descripción
Sumario:Understanding the relationship between bacterioplankton and coastal phytoplankton blooms is key to understand coastal ecosystems functioning, which are the most productive areas for fisheries. With that knowledge, we could predict and may be mitigate, the effects of global change or contamination events in these productive ecosystems. However, these microbial communities are governed by very complex relationships. In addition, the data used to study bacterioplankton diversity (Amplicon Sequence Variants of 16S rRNA gene) is highly dimensional, sparse, and noisy. In this project, Random Forest classifiers based on diversity data were used to predict coastal phytoplankton blooms and search for their biomarkers. After joining two oceanographic campaigns data, samples were classified as bloom or normal depending on the total chlorophyll concentrations. The resulting dataset was highly dimensional (166 instances, 7593 features) and imbalanced (31 instances bloom, 135 – normal). To reduce dimensionality, biological features with relative abundances below 0.01 were removed, or they were grouped into clusters at genus level. Random forest models were trained and tuned with a grid-search of the number of features included in the individual trees. The process was repeated using one hundred different data splits into train and test groups to ensure results’ representativity. Good performance values (kappa, sensitivity, and specificity > 0.8) were achieved only after using the synthetic minority oversampling technique to level the number of instances between the two categories. Using those models, the topmost important features, according to the predictive error rate of features, were selected as biomarkers.