Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
Code (coming soon) arXivWe propose a structure and contact-aware representation acts as an additional interaction-oriented generative supervision signal. By learning to jointly generate videos and our representation at a large scale, our model captures interaction patterns consistent with physical constraints, enabling strong generalization to complex open-world interactions, even with unseen non-rigid objects.
Abstract
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design.
- We propose a structure and contact-aware representation as a scalable and interaction-oriented supervisory signal that guides the model to capture fine-grained interaction physics. We curate this representation for over 100k HOI videos, facilitating large-scale training.
- We introduce a joint-generation paradigm with a share-and-specialization strategy that generates proposed HOI representations and videos simultaneously, mitigating multi-stage error accumulation.
- Extensive experiments demonstrate our method generates physics-realistic HOI videos, surpassing state-of-the-art methods on two real-world datasets and showing strong generalization to open-world scenarios.
Our HOI Representation
Joint Generation of Video and Representation
Experimental Results
“Use the knife to cut the plate.”
CogVideoX
Wan2.1
FLOVD
SCARC(our)
SCARW(our)
“Use the eraser to brush the bowl.”
CogVideoX
Wan2.1
FLOVD
SCARC(our)
SCARW(our)
“Put out something from the bowl using the spatula.”
CogVideoX
Wan2.1
FLOVD
SCARC(our)
SCARW(our)
“Move the regular notebook next to the pink mug.”
CogVideoX
Wan2.1
FLOVD
SCARC(our)
SCARW(our)
“Place the red ham in the stockpot.”
CogVideoX
Wan2.1
FLOVD
SCARC(our)
SCARW(our)
Generalization to Open-world Scenarios
“Pick up the purple cloth and wipe the green bowl.”
CogVideoX
Wan2.1
FLOVD
SCARW(our)
“Move the orange carrot toy into the semi-transparent glass cup.”
CogVideoX
Wan2.1
FLOVD
SCARW(our)
“Move the purple grape toy into the semi-transparent glass cup.”
CogVideoX
Wan2.1
FLOVD
SCARW(our)
“Move the green patterned earbud case onto the silver laptop.”
CogVideoX
Wan2.1
FLOVD
SCARW(our)
“Pick up the small purple grape toy and put it in the light green bowl.”
CogVideoX
Wan2.1
FLOVD
SCARW(our)