Papers
arxiv:2603.15614

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Published on Mar 16
ยท Submitted by
Zhenghong Zhou
on Mar 17
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Tri-Prompting presents a unified framework for video diffusion that enables joint control of scene composition, multi-view subject consistency, and motion, achieving superior performance in identity preservation and 3D consistency.

AI-generated summary

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Community

Paper submitter

๐ŸŽฌ Tri-Prompting: Scene (where), Subject (who), and Motion (how)โ€”unified at last!

Current video diffusion models often struggle with fine-grained, joint control. We introduce Tri-Prompting, a unified framework that enables simultaneous control over scene composition, multi-view subject consistency, and motion.

Key Highlights:
๐Ÿ”น Unified Control: Jointly manages scene, subject, and motion in one model.
๐Ÿ”น Dual-Conditioning & multi-view subject consistency: Separates foreground/background motion while preserving identity across views.
๐Ÿ”น 3D-Aware Applications & strong results: Enables multi-view subject insertion and manipulation, and competitive performance against DaS and Phantom.

๐Ÿ”— Demos: https://zhouzhenghong-gt.github.io/Tri-Prompting-Page/
๐Ÿ”— paper: https://arxiv.org/abs/2603.15614

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.15614 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.15614 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.15614 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.