JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation

Mingyeong Song1, Jungbin Cho2,3, Jisoo Kim2, Ananya Bal3, Kartik Sharma3, Youngjae Yu4, Laszlo A. Jeni3, Junhyug Noh1
1Ewha Womans University   2Yonsei University   3Carnegie Mellon University   4Seoul National University

JointHOI generates physically grounded bimanual interactions from diverse text prompts.


Abstract

Text-driven hand–object interaction (HOI) generation is gaining attention for immersive applications and robotics, yet producing physically plausible interactions remains challenging. Even when individual motions appear natural, small contact errors can cause conspicuous artifacts such as floating and interpenetration. Prior methods mitigate these issues using explicit contact cues or implicit grasp priors, but typically rely on multi-stage pipelines and fail to model temporally evolving contact.

We present JointHOI, a single-stage diffusion framework that jointly generates 3D hand-object motion and dynamic, distance-based contact maps from text. By treating contact as an auxiliary inner modality, joint generation enables the model to learn contact–motion coupling during training. At inference, contact-guided sampling enforces consistency between generated contact maps and motion-implied geometry, improving temporal stability and reducing penetration and floating.


Method

To address the limitations of prior multi-stage or implicit approaches, we propose JointHOI, a framework that treats contact not as a post-processing signal, but as an inner modality of hand–object motion.

We jointly model 3D hand–object motion and dynamic, distance-based contact maps within a unified diffusion framework.

  • Joint Generation. We introduce a single-stage diffusion model that jointly generates 3D hand–object motion together with dynamic, distance-based contact maps. By modeling motion and contact within a unified generative process, the model learns explicit contact–motion coupling and captures their spatiotemporal co-evolution.
  • Contact Inner Guidance (CIG). At inference time, we leverage the generated contact maps as an inner guidance signal to steer the denoising process. CIG enforces consistency between predicted contact and motion-implied geometry, significantly reducing artifacts such as interpenetration, floating, and unstable grasps.

This design enables physically plausible and temporally stable hand–object interactions directly from text, without requiring multi-stage pipelines or post-hoc refinement.


Experiments

Quantitative Results

Qualitative Results


Videos

Comparison with Baselines

Additional Qualitative Results


BibTeX

@inproceedings{song2026jointhoi,
  title     = {JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation},
  author    = {Song, Mingyeong and Cho, Jungbin and Kim, Jisoo and Bal, Ananya and Sharma, Kartik and Yu, Youngjae and Jeni, Laszlo A. and Noh, Junhyug},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026},
}