Text-driven hand–object interaction (HOI) generation is gaining attention for immersive applications and robotics, yet producing physically plausible interactions remains challenging. Even when individual motions appear natural, small contact errors can cause conspicuous artifacts such as floating and interpenetration. Prior methods mitigate these issues using explicit contact cues or implicit grasp priors, but typically rely on multi-stage pipelines and fail to model temporally evolving contact.
We present JointHOI, a single-stage diffusion framework that jointly generates 3D hand-object motion and dynamic, distance-based contact maps from text. By treating contact as an auxiliary inner modality, joint generation enables the model to learn contact–motion coupling during training. At inference, contact-guided sampling enforces consistency between generated contact maps and motion-implied geometry, improving temporal stability and reducing penetration and floating.
To address the limitations of prior multi-stage or implicit approaches, we propose JointHOI, a framework that treats contact not as a post-processing signal, but as an inner modality of hand–object motion.
We jointly model 3D hand–object motion and dynamic, distance-based contact maps within a unified diffusion framework.
This design enables physically plausible and temporally stable hand–object interactions directly from text, without requiring multi-stage pipelines or post-hoc refinement.
@inproceedings{song2026jointhoi,
title = {JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation},
author = {Song, Mingyeong and Cho, Jungbin and Kim, Jisoo and Bal, Ananya and Sharma, Kartik and Yu, Youngjae and Jeni, Laszlo A. and Noh, Junhyug},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2026},
}