GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form

Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on red...

Descripción completa

Detalles Bibliográficos
Autores: Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a, Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e, Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c
Tipo de recurso: artículo
Fecha de publicación:2020
País:España
Institución:Universidad de Navarra
Repositorio:Dadun. Depósito Académico Digital de la Universidad de Navarra
Idioma:inglés
OAI Identifier:oai:dadun.unav.edu:10171/113614
Acceso en línea:https://hdl.handle.net/10171/113614
Access Level:acceso abierto
Palabra clave:GPress
General feature format (GFF)
Files
Compressed form
id ES_be9be2a0f745f7104f2704780cd533cf
oai_identifier_str oai:dadun.unav.edu:10171/113614
network_acronym_str ES
network_name_str España
repository_id_str
spelling GPress: a framework for querying general feature format (GFF) files and expression files in a compressed formHernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5aMeng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3eOchoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3cGPressGeneral feature format (GFF)FilesCompressed formMotivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.Oxford University PressDadun. Depósito Académico Digital Universidad de Navarra20202020-01-0120202020-01-01journal articlehttp://purl.org/coar/resource_type/c_6501info:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/10171/113614reponame:Dadun. Depósito Académico Digital de la Universidad de Navarrainstname:Universidad de NavarraInglésengopen accesshttp://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessoai:dadun.unav.edu:10171/1136142026-06-21T12:47:57Z
dc.title.none.fl_str_mv GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
title GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
spellingShingle GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a
GPress
General feature format (GFF)
Files
Compressed form
title_short GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
title_full GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
title_fullStr GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
title_full_unstemmed GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
title_sort GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
dc.creator.none.fl_str_mv Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a
Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e
Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c
author Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a
author_facet Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a
Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e
Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c
author_role author
author2 Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e
Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c
author2_role author
author
dc.contributor.none.fl_str_mv Dadun. Depósito Académico Digital Universidad de Navarra
dc.subject.none.fl_str_mv GPress
General feature format (GFF)
Files
Compressed form
topic GPress
General feature format (GFF)
Files
Compressed form
description Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.
publishDate 2020
dc.date.none.fl_str_mv 2020
2020-01-01
2020
2020-01-01
dc.type.none.fl_str_mv journal article
http://purl.org/coar/resource_type/c_6501
dc.type.openaire.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.none.fl_str_mv https://hdl.handle.net/10171/113614
url https://hdl.handle.net/10171/113614
dc.language.none.fl_str_mv Inglés
eng
language_invalid_str_mv Inglés
language eng
dc.rights.none.fl_str_mv open access
http://purl.org/coar/access_right/c_abf2
dc.rights.openaire.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv open access
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Oxford University Press
publisher.none.fl_str_mv Oxford University Press
dc.source.none.fl_str_mv reponame:Dadun. Depósito Académico Digital de la Universidad de Navarra
instname:Universidad de Navarra
instname_str Universidad de Navarra
reponame_str Dadun. Depósito Académico Digital de la Universidad de Navarra
collection Dadun. Depósito Académico Digital de la Universidad de Navarra
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1869418299873296384
score 15,81155