Image based accurate body pose and shape estimation from 2D landmarks
Estimating the 3D pose and shape of the human body from a single 2D image remains one of the most challenging and actively researched problems in computer vision. This task has wide-ranging applications in human-computer interaction, virtual and augmented reality, sports analysis, and medical diagno...
| Autor: | |
|---|---|
| Tipo de recurso: | tesis de maestría |
| Fecha de publicación: | 2025 |
| País: | España |
| Institución: | Universitat Politècnica de Catalunya (UPC) |
| Repositorio: | UPCommons. Portal del coneixement obert de la UPC |
| Idioma: | inglés |
| OAI Identifier: | oai:upcommons.upc.edu:2117/430202 |
| Acceso en línea: | https://hdl.handle.net/2117/430202 |
| Access Level: | acceso abierto |
| Palabra clave: | Computer vision Three-dimensional display systems Deep learning (Machine learning) Estimació de la Posició Humana 3D SMPLX Elevació de 2D a 3D Detecció de Punts Clau Fites Volumètriques Aprenentatge Profund Modelatge del Cos Humà 3D Human Pose Estimation 2D-to-3D Lifting Keypoint Detection Volumetric Landmarks Deep Learning Human Body Modeling Visió per ordinador Visualització tridimensional (Informàtica) Aprenentatge profund Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic |
| Sumario: | Estimating the 3D pose and shape of the human body from a single 2D image remains one of the most challenging and actively researched problems in computer vision. This task has wide-ranging applications in human-computer interaction, virtual and augmented reality, sports analysis, and medical diagnostics. The inherent complexity lies in the ambiguities introduced by projecting a three-dimensional structure into two dimensions, where information related to depth, occlusions, and body shape variation is lost. This master’s thesis addresses this challenge by proposing a novel and modular pipeline that integrates recent advances in 2D keypoint detection with parametric human body models, specifically focusing on enhancing the precision of the 2D-to-3D lifting process. The proposed pipeline is structured around three core components: (1) the selection and training of high-performance 2D keypoint detectors, (2) the definition and enhancement of body landmarks beyond the conventional joint definitions, and (3) the adaptation of SMPLifyX to incorporate these additional landmarks for improved 3D body fitting. For the 2D detection stage, the performance of several state-of-theart architectures, ViTPose, HRNet, and PVTv2, is evaluated and compared using the BEDLAM dataset, which offers ground-truth annotations in SMPLX format. These detectors are evaluated on the basis of their PCK accuracy and ability to generalize across different datasets, such as MS COCO and SSP-3D. A central contribution of this work is the introduction of a novel set of volumetric landmarks, carefully designed to capture additional information about the shape of the human body. These landmarks are defined through an intuitive anatomical approach and are used to supplement traditional joint definitions during both training and evaluation. The resulting extended keypoint sets, comprising 14, 39 and 67 landmarks, are shown to improve the fidelity of the 2D-to-3D lifting process, particularly when integrated into the objective function of SMPLifyX. This modified optimization process uses the additional information provided by the landmarks to better constrain the estimation of shape and pose, especially in scenarios where standard joints alone might be insufficient. Quantitative evaluation is conducted using standard metrics in the field, Mean Per Joint Position Error (MPJPE) and Mean Per Vertex Position Error (MPVPE). Experiments using the 3DPW dataset, with SMPLX annotations provided by the BEDLAM project, demonstrate that the inclusion of landmarks significantly improves reconstruction accuracy, often outperforming state-of-the-art methods under equivalent conditions. A qualitative analysis is also performed using SSP-3D, highlighting how landmark-enriched detection contributes to improved shape and pose realism, particularly for non-standard body types. In summary, this thesis contributes both a practical pipeline for improved monocular human body estimation and a methodological framework for integrating anatomical landmarks into existing models. The results confirm that the precision of 2D detection is crucial for accurate 3D estimation and that enriching the input representation with volumetric cues leads to measurable improvements. This work thus advances the field’s understanding of how 2D-to-3D lifting can be refined and opens new avenues for future exploration in multimodal body modeling. |
|---|