GPU implementation of bitplane coding with parallel coefficient processing for high performance image compression

The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized...

ver descrição completa

Detalhes bibliográficos
Autores: Enfedaque Montes, Pablo, Aulí Llinàs, Francesc|||0000-0002-3208-9957, Moure, Juan C|||0000-0001-6697-0331
Tipo de documento: artigo
Data de publicação:2017
País:España
Recursos:Universitat Autònoma de Barcelona
Repositório:Dipòsit Digital de Documents de la UAB
Idioma:inglês
OAI Identifier:oai:ddd.uab.cat:183585
Acesso em linha:https://ddd.uab.cat/record/183585
https://dx.doi.org/urn:doi:10.1109/TPDS.2017.2657506
Access Level:Acceso aberto
Palavra-chave:Image coding
SIMD computing
Graphics processing unit (GPU)
Compute unified device architecture (CUDA)
Descrição
Resumo:The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized hardware like field integrated gate arrays. Graphics Processing Units (GPUs) do not commonly achieve high performance on these algorithms because they do not exhibit fine-grain parallelism. Our previous work introduced a new core algorithm for wavelet-based image coding systems. It is tailored for massive parallel architectures. It is called bitplane coding with parallel coefficient processing (BPC-PaCo). This paper introduces the first high performance, GPU-based implementation of BPC-PaCo. A detailed analysis of the algorithm aids its implementation in the GPU. The main insights behind the proposed codec are an efficient thread-to-data mapping, a smart memory management, and the use of efficient cooperation mechanisms to enable inter-thread communication. Experimental results indicate that the proposed implementation matches the requirements for high resolution (4 K) digital cinema in real time, yielding speedups of 30x with respect to the fastest implementations of current compression standards. Also, a power consumption evaluation shows that our implementation consumes 40 x less energy for equivalent performance than state-of-the-art methods.