Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

1The Hong Kong University of Science and Technology (Guangzhou), 2Huawei Cloud
Code (coming soon) arXiv

We propose a structure and contact-aware representation acts as an additional interaction-oriented generative supervision signal. By learning to jointly generate videos and our representation at a large scale, our model captures interaction patterns consistent with physical constraints, enabling strong generalization to complex open-world interactions, even with unseen non-rigid objects.

Abstract

Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design.

  • We propose a structure and contact-aware representation as a scalable and interaction-oriented supervisory signal that guides the model to capture fine-grained interaction physics. We curate this representation for over 100k HOI videos, facilitating large-scale training.
  • We introduce a joint-generation paradigm with a share-and-specialization strategy that generates proposed HOI representations and videos simultaneously, mitigating multi-stage error accumulation.
  • Extensive experiments demonstrate our method generates physics-realistic HOI videos, surpassing state-of-the-art methods on two real-world datasets and showing strong generalization to open-world scenarios.

Our HOI Representation

Generated hand-object interaction sequences
Overview of our structure and contact-aware representation curation pipeline. It begins with (a) Segmentation Extraction, where a CoT-guided VLM grounds hand and object from the input RGB video, and SAM2 generates HOI masks. Next, (b) Contact Region Estimation produces the final contact-augmented hand-object contours by computing a contact region from the intersection of the dilated hand and object contours. In parallel, (c) Video Depth Estimation generates a dense depth map sequence for holistic structure. Finally, these contact-augmented hand-object contours are alpha-blended onto the depth maps to form the final HOI representation.

Joint Generation of Video and Representation

Pipeline diagram of our JointHOI model
The joint-generation paradigm of our method. Given an observed image and a task description, our framework jointly generates a video and its corresponding HOI representation. The core technical novelty lies in the Hierarchical Joint Denoiser that co-denoises visual and interaction tokens within a unified latent space. First, the Shared Semantics module enforces cross-modal consistency via an alignment loss (maximizing cosine similarity) to capture shared semantics like spatial layout and temporal dynamics. Then, the Specialized Details module adds a learnable interaction embedding to capture modality-specific details. Finally, the denoised predicted visual and interaction tokens are passed through the VAE decoder to reconstruct both outputs.

Experimental Results

Observed Image:
Task Description:

“Use the knife to cut the plate.”

CogVideoX

Wan2.1

FLOVD

SCARC(our)

SCARW(our)

Observed Image:
Task Description:

“Use the eraser to brush the bowl.”

CogVideoX

Wan2.1

FLOVD

SCARC(our)

SCARW(our)

Observed Image:
Task Description:

“Put out something from the bowl using the spatula.”

CogVideoX

Wan2.1

FLOVD

SCARC(our)

SCARW(our)

Observed Image:
Task Description:

“Move the regular notebook next to the pink mug.”

CogVideoX

Wan2.1

FLOVD

SCARC(our)

SCARW(our)

Observed Image:
Task Description:

“Place the red ham in the stockpot.”

CogVideoX

Wan2.1

FLOVD

SCARC(our)

SCARW(our)

Generalization to Open-world Scenarios

Observed Image:
Task Description:

“Pick up the purple cloth and wipe the green bowl.”

CogVideoX

Wan2.1

FLOVD

SCARW(our)

Observed Image:
Task Description:

“Move the orange carrot toy into the semi-transparent glass cup.”

CogVideoX

Wan2.1

FLOVD

SCARW(our)

Observed Image:
Task Description:

“Move the purple grape toy into the semi-transparent glass cup.”

CogVideoX

Wan2.1

FLOVD

SCARW(our)

Observed Image:
Task Description:

“Move the green patterned earbud case onto the silver laptop.”

CogVideoX

Wan2.1

FLOVD

SCARW(our)

Observed Image:
Task Description:

“Pick up the small purple grape toy and put it in the light green bowl.”

CogVideoX

Wan2.1

FLOVD

SCARW(our)