CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

Abstract

We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the \layer, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism.

Layer Analysis

Attention Visualization

Attention Score Analysis

We generate 60 videos using our base video model. For each video, we then sample two sets of three layers: three randomly selected layers from Motion Layers, and three randomly selected layers from 11 lowest scoring layers among Non-Motion Layers. We report DAVIS benchmark metrics to measure mask alignment quality, and we additionally compute a VQA score to assess text–video semantic accuracy.

Generated Video

Skip Non-Motion Layers

Skip Motion Layers

	VQA Score
Generated Video	0.4540
Motion Layers	0.2742
Non Motion Layers	0.4068

VQA Score

Generated Video

0.4540

Motion Layers

0.2742

Non Motion Layers

0.4068

DAVIS Metrics	J-Mean	J-Recall	J-Decay	F-Mean	F-Recall	F-Decay	J&F	HOTA
Motion Layers	60.161	71.020	26.614	59.984	63.435	29.584	60.073	57.332
Non Motion Layers	77.768	91.156	17.726	81.133	88.844	19.297	79.451	82.313

DAVIS Metrics

J-Mean

J-Recall

J-Decay

F-Mean

F-Recall

F-Decay

J&F

HOTA

Motion Layers

60.161

71.020

26.614

59.984

63.435

29.584

60.073

57.332

Non Motion Layers

77.768

91.156

17.726

81.133

88.844

19.297

79.451

82.313

Method

BEHAVE - Human Object Interaction

BibTeX

@misc{meric2026comogen, title={CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration}, author={Adil Meric and Lin Geng Foo and Mert Kiray and Benjamin Busam and Rishabh Dabral and Christian Theobalt}, year={2026}, eprint={2605.22996}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.22996}, }