GPU implementation of bitplane coding with parallel coefficient processing for high performance image compression

The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized...

Descripción completa

Detalles Bibliográficos
Autores: Enfedaque Montes, Pablo, Aulí Llinàs, Francesc|||0000-0002-3208-9957, Moure, Juan C|||0000-0001-6697-0331
Tipo de recurso: artículo
Fecha de publicación:2017
País:España
Institución:Universitat Autònoma de Barcelona
Repositorio:Dipòsit Digital de Documents de la UAB
Idioma:inglés
OAI Identifier:oai:ddd.uab.cat:183585
Acceso en línea:https://ddd.uab.cat/record/183585
https://dx.doi.org/urn:doi:10.1109/TPDS.2017.2657506
Access Level:acceso abierto
Palabra clave:Image coding
SIMD computing
Graphics processing unit (GPU)
Compute unified device architecture (CUDA)
Descripción
Sumario:The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized hardware like field integrated gate arrays. Graphics Processing Units (GPUs) do not commonly achieve high performance on these algorithms because they do not exhibit fine-grain parallelism. Our previous work introduced a new core algorithm for wavelet-based image coding systems. It is tailored for massive parallel architectures. It is called bitplane coding with parallel coefficient processing (BPC-PaCo). This paper introduces the first high performance, GPU-based implementation of BPC-PaCo. A detailed analysis of the algorithm aids its implementation in the GPU. The main insights behind the proposed codec are an efficient thread-to-data mapping, a smart memory management, and the use of efficient cooperation mechanisms to enable inter-thread communication. Experimental results indicate that the proposed implementation matches the requirements for high resolution (4 K) digital cinema in real time, yielding speedups of 30x with respect to the fastest implementations of current compression standards. Also, a power consumption evaluation shows that our implementation consumes 40 x less energy for equivalent performance than state-of-the-art methods.