Understanding Human Interaction Through 4D Motion Reconstruction
Researchers: Jyun-Ting Song Thesis Committee: Kris Kitani, chair Fernando De La Torre Shubham Tulsiani Erica Weng
By Ashlyn Lacovara
Understanding human behavior requires more than modeling people in isolation. In everyday life, humans constantly interact—with one another through physical contact, and with objects through grasping, pushing, and fine hand manipulation. Capturing these interactions accurately is essential for progress in fields such as robotics, virtual and augmented reality, animation, and human-centered artificial intelligence.
Carnegie Mellon University’s Robotics Institute is addressing this challenge by developing markerless, multi-view 4D human reconstruction systems that can capture full-body motion over time—even during close human-human and human-object interactions where existing methods often fail.
For decades, large-scale human motion datasets have relied on marker-based motion capture systems. These systems attach reflective markers to a person’s body and track them using multiple infrared cameras, producing extremely accurate 3D reconstructions. While effective in controlled studios, marker-based systems have major drawbacks: they restrict natural movement, require specialized environments, and introduce visual artifacts that can bias vision-based learning models toward the markers themselves rather than true human shape and motion.
Markerless motion capture aims to remove these constraints. Among markerless approaches, vision-based multi-view capture has emerged as a powerful alternative. By synchronizing and calibrating multiple cameras around a scene, these systems reconstruct 3D human pose and geometry directly from video, enabling more natural motion capture and producing raw visual data useful for many downstream tasks.
However, vision-based systems face a critical limitation: they break down during interaction.
When people interact closely—wrestling, dancing, fencing, or manipulating objects—body parts frequently block one another from view. Hands disappear behind torsos, limbs overlap, and objects obscure fingers. These effects, known as occlusion and truncation, cause standard pose-estimation pipelines to fail.
Most existing systems rely on detecting 2D body joints in individual camera views and triangulating them into 3D. When joints are missing or ambiguous in 2D, errors compound during triangulation, leading to unstable or incomplete reconstructions—especially in fine-grained regions like hands and fingers. As a result, many methods either drop entire people from the reconstruction or produce physically implausible meshes during contact. This research directly tackles these failure modes.
Rather than treating interaction as an edge case, the proposed system is designed around it. The approach uses a small, mobile, markerless multi-camera setup, combining first-person and third-person views to capture complex interactions in real-world environments. All cameras are synchronized and calibrated into a shared, gravity-aligned coordinate system, allowing precise 3D reasoning without requiring a studio or physical markers.
The processing pipeline operates in multiple stages:
- Multi-View Instance Segmentation
Each camera view first separates individual people at the pixel level, even when bodies overlap. This step is critical for distinguishing who is who during physical contact. - Segmentation-Conditioned 2D Pose Estimation
Instead of estimating body joints in isolation, pose estimation is conditioned on the segmentation masks. This allows the system to reason about missing or fully occluded body parts and to disambiguate joints when multiple people are in close proximity. - Temporal 3D Pose Forecasting
When joints are momentarily invisible, the system uses motion information from earlier frames to predict where they should be. This temporal feedback loop stabilizes tracking through severe occlusion and fast movement. - Multi-View 3D Reconstruction
Cleaned and disambiguated 2D poses are triangulated across views to recover consistent 3D skeletons over time. - Physics-Aware Mesh Fitting
Finally, a parametric human body model is fitted to the 3D skeletons using multi-stage optimization. The system explicitly minimizes mesh interpenetration, ensuring that reconstructed bodies remain physically plausible during contact.
Together, these steps enable the extraction of 4D human meshes—full 3D geometry evolving over time—even in scenarios that defeat conventional motion capture.
Using this pipeline, the researchers created Harmony4D, a large-scale, in-the-wild dataset focused on close human-human interactions.
Harmony4D contains:
- Over 1.6 million multi-view images
- More than 3.3 million visible human instances
- Ground-truth camera parameters, 2D and 3D poses, tracking identities, and full 3D human meshes
- Vertex-level contact annotations between interacting bodies
Unlike prior datasets, Harmony4D emphasizes natural movement, subject diversity, and real-world settings, making it far more representative of everyday human interaction.
Together, Harmony4D and Contact4D demonstrate that the primary bottleneck in modeling human interaction is not model design alone, but the lack of realistic, large-scale data under interaction scenarios. When existing reconstruction methods are fine-tuned on these datasets, they show dramatic improvements in occlusion handling, contact reasoning, and mesh quality—often outperforming methods explicitly designed for interaction.
By combining novel multi-view reconstruction pipelines with large, diverse datasets, this work lays critical groundwork for building AI systems that understand humans as they actually move and interact in the real world—not just in controlled labs.
