Image based accurate body pose and shape estimation from 2D landmarks

Estimating the 3D pose and shape of the human body from a single 2D image remains one of the most challenging and actively researched problems in computer vision. This task has wide-ranging applications in human-computer interaction, virtual and augmented reality, sports analysis, and medical diagno...

Descripción completa

Detalles Bibliográficos
Autor: Villota Pismag, John Kelly
Tipo de recurso: tesis de maestría
Fecha de publicación:2025
País:España
Institución:Universitat Politècnica de Catalunya (UPC)
Repositorio:UPCommons. Portal del coneixement obert de la UPC
Idioma:inglés
OAI Identifier:oai:upcommons.upc.edu:2117/430202
Acceso en línea:https://hdl.handle.net/2117/430202
Access Level:acceso abierto
Palabra clave:Computer vision
Three-dimensional display systems
Deep learning (Machine learning)
Estimació de la Posició Humana 3D
SMPLX
Elevació de 2D a 3D
Detecció de Punts Clau
Fites Volumètriques
Aprenentatge Profund
Modelatge del Cos Humà
3D Human Pose Estimation
2D-to-3D Lifting
Keypoint Detection
Volumetric Landmarks
Deep Learning
Human Body Modeling
Visió per ordinador
Visualització tridimensional (Informàtica)
Aprenentatge profund
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
Descripción
Sumario:Estimating the 3D pose and shape of the human body from a single 2D image remains one of the most challenging and actively researched problems in computer vision. This task has wide-ranging applications in human-computer interaction, virtual and augmented reality, sports analysis, and medical diagnostics. The inherent complexity lies in the ambiguities introduced by projecting a three-dimensional structure into two dimensions, where information related to depth, occlusions, and body shape variation is lost. This master’s thesis addresses this challenge by proposing a novel and modular pipeline that integrates recent advances in 2D keypoint detection with parametric human body models, specifically focusing on enhancing the precision of the 2D-to-3D lifting process. The proposed pipeline is structured around three core components: (1) the selection and training of high-performance 2D keypoint detectors, (2) the definition and enhancement of body landmarks beyond the conventional joint definitions, and (3) the adaptation of SMPLifyX to incorporate these additional landmarks for improved 3D body fitting. For the 2D detection stage, the performance of several state-of-theart architectures, ViTPose, HRNet, and PVTv2, is evaluated and compared using the BEDLAM dataset, which offers ground-truth annotations in SMPLX format. These detectors are evaluated on the basis of their PCK accuracy and ability to generalize across different datasets, such as MS COCO and SSP-3D. A central contribution of this work is the introduction of a novel set of volumetric landmarks, carefully designed to capture additional information about the shape of the human body. These landmarks are defined through an intuitive anatomical approach and are used to supplement traditional joint definitions during both training and evaluation. The resulting extended keypoint sets, comprising 14, 39 and 67 landmarks, are shown to improve the fidelity of the 2D-to-3D lifting process, particularly when integrated into the objective function of SMPLifyX. This modified optimization process uses the additional information provided by the landmarks to better constrain the estimation of shape and pose, especially in scenarios where standard joints alone might be insufficient. Quantitative evaluation is conducted using standard metrics in the field, Mean Per Joint Position Error (MPJPE) and Mean Per Vertex Position Error (MPVPE). Experiments using the 3DPW dataset, with SMPLX annotations provided by the BEDLAM project, demonstrate that the inclusion of landmarks significantly improves reconstruction accuracy, often outperforming state-of-the-art methods under equivalent conditions. A qualitative analysis is also performed using SSP-3D, highlighting how landmark-enriched detection contributes to improved shape and pose realism, particularly for non-standard body types. In summary, this thesis contributes both a practical pipeline for improved monocular human body estimation and a methodological framework for integrating anatomical landmarks into existing models. The results confirm that the precision of 2D detection is crucial for accurate 3D estimation and that enriching the input representation with volumetric cues leads to measurable improvements. This work thus advances the field’s understanding of how 2D-to-3D lifting can be refined and opens new avenues for future exploration in multimodal body modeling.