GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form
Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on red...
| Autores: | , , |
|---|---|
| Tipo de recurso: | artículo |
| Fecha de publicación: | 2020 |
| País: | España |
| Institución: | Universidad de Navarra |
| Repositorio: | Dadun. Depósito Académico Digital de la Universidad de Navarra |
| Idioma: | inglés |
| OAI Identifier: | oai:dadun.unav.edu:10171/113614 |
| Acceso en línea: | https://hdl.handle.net/10171/113614 |
| Access Level: | acceso abierto |
| Palabra clave: | GPress General feature format (GFF) Files Compressed form |
| id |
ES_be9be2a0f745f7104f2704780cd533cf |
|---|---|
| oai_identifier_str |
oai:dadun.unav.edu:10171/113614 |
| network_acronym_str |
ES |
| network_name_str |
España |
| repository_id_str |
|
| spelling |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed formHernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5aMeng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3eOchoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3cGPressGeneral feature format (GFF)FilesCompressed formMotivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average.Oxford University PressDadun. Depósito Académico Digital Universidad de Navarra20202020-01-0120202020-01-01journal articlehttp://purl.org/coar/resource_type/c_6501info:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/10171/113614reponame:Dadun. Depósito Académico Digital de la Universidad de Navarrainstname:Universidad de NavarraInglésengopen accesshttp://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessoai:dadun.unav.edu:10171/1136142026-06-21T12:47:57Z |
| dc.title.none.fl_str_mv |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| title |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| spellingShingle |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a GPress General feature format (GFF) Files Compressed form |
| title_short |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| title_full |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| title_fullStr |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| title_full_unstemmed |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| title_sort |
GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form |
| dc.creator.none.fl_str_mv |
Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c |
| author |
Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a |
| author_facet |
Hernaez-Arrazola, M. (Mikel)|||/items/954a4ee7-b04c-4dc5-9bfc-7a48332c7e5a Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c |
| author_role |
author |
| author2 |
Meng, Q. (Qingxi)|||/items/7504e046-288f-497b-8114-e5d9385abf3e Ochoa-Álvarez, I. (Idoia)|||/items/6326dacc-419f-4156-a9c3-9e79cfcc6a3c |
| author2_role |
author author |
| dc.contributor.none.fl_str_mv |
Dadun. Depósito Académico Digital Universidad de Navarra |
| dc.subject.none.fl_str_mv |
GPress General feature format (GFF) Files Compressed form |
| topic |
GPress General feature format (GFF) Files Compressed form |
| description |
Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results: We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. |
| publishDate |
2020 |
| dc.date.none.fl_str_mv |
2020 2020-01-01 2020 2020-01-01 |
| dc.type.none.fl_str_mv |
journal article http://purl.org/coar/resource_type/c_6501 |
| dc.type.openaire.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| dc.identifier.none.fl_str_mv |
https://hdl.handle.net/10171/113614 |
| url |
https://hdl.handle.net/10171/113614 |
| dc.language.none.fl_str_mv |
Inglés eng |
| language_invalid_str_mv |
Inglés |
| language |
eng |
| dc.rights.none.fl_str_mv |
open access http://purl.org/coar/access_right/c_abf2 |
| dc.rights.openaire.fl_str_mv |
info:eu-repo/semantics/openAccess |
| rights_invalid_str_mv |
open access http://purl.org/coar/access_right/c_abf2 |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Oxford University Press |
| publisher.none.fl_str_mv |
Oxford University Press |
| dc.source.none.fl_str_mv |
reponame:Dadun. Depósito Académico Digital de la Universidad de Navarra instname:Universidad de Navarra |
| instname_str |
Universidad de Navarra |
| reponame_str |
Dadun. Depósito Académico Digital de la Universidad de Navarra |
| collection |
Dadun. Depósito Académico Digital de la Universidad de Navarra |
| repository.name.fl_str_mv |
|
| repository.mail.fl_str_mv |
|
| _version_ |
1869418299873296384 |
| score |
15,81155 |