Spatio-Temporal Control for Masked Motion Synthesis (ICCV 2025)

[Previous Name] ControlMM: Controllable Masked Motion Generation

Ekkasit Pinyoanuntapong1, Muhammad Usama Saleem1, Korrawe Karunratanakul2, Pu Wang1, Hongfei Xue1, Chen Chen3, Chuan Guo4, Junli Cao4, Jian Ren4, Sergey Tulyakov4
1University of North Carolina at Charlotte, 2ETH Zurich, 3University of Central Florida, 4Snap Inc.
arXiv Code
twbs

TL;DR

Masked Motion Models have shown superior quality and speed compared to Motion Diffusion Models. However, current SOTA methods for motion control primarily rely on Motion Diffusion Models. MaskControl is the first to introduce controllability to Masked Motion Models through two novel components:
  1. Logits Regularizer (ControlNet-like for Masked Model)
  2. Logits Optimization (Inference time guidance for Masked Model)
The non-differentiable nature of the quantized model is addressed via Differentiable Expectation Sampling. MaskControl achieves SOTA in both quality and control precision, while supporting a wide range of applications.

Method

twbs

Compared to SOTA - Multiple Joints

a person crosses their arms for chest fly

MaskControl (our)

OmniControl

MotionLCM

a person jumps in the air once

MaskControl (our)

OmniControl

MotionLCM

a person walks in a circle clockwise

MaskControl (our)

OmniControl

MotionLCM

a person walks forward and waves his hands

MaskControl (our)

OmniControl

MotionLCM

Compared to SOTA - Pelvis Only

a person walks forward and waves his hands
a person dances to salsa music

MaskControl (our)

GMD

MaskControl (our)

GMD

a person walks forward and come back to the same position from where we started

MaskControl (our)

GMD

Compared to STMC for Body Part Timeline Control

MaskControl (our)

STMC

No "pick something", no "wipe", and walk to the wrong side

Dense Signals

the person draws a heart with hand

person walks down and up in a figure 8 pattern

A figure walks forward in a zig zag pattern

a person waves both his arms

someone is lifting something up

a person stands and waving

a man walks in a curved line with his hands at his sides

a person walks with support

a person walks

Sparse Signals

A person walks forward with their hands up in a surrender pose

person walks over and sits down in a chair.

A person jumps and kicks a football in the air with their head

A person walks forward, casually greeting others with a wave or hello

a man walks left and right

A person walks, pauses, and performs a high kick in the air.

Body Part Timeline Control

Upper Body: a person puts hands in the air.
Left Foot : a person kicks left legs.
Lower Body: a person jumps forward.
0 frames 60 120 frames
Generating motion for the upper body from frames 0 to 120 based on the “a person puts hands in the air.” For the lower body, motion is generated in two parts: From frames 0 to 60, based on the “a person kicks left legs.” From frames 60 to 120, based on the “a person jumps forward.”
Upper Body: the person is bending over forward
Left Foot : shake with their left leg
0 frames 60 120 frames
Generating upper body motion from frames 0 to 120 based on the prompt: "the person is bending over forward" Simultaneously, lower body motion is generated from frames 0 to 120 based on the prompt: "shake with their left leg"

Obstacle Avoidance

the man walks zig zag.
the man walks forward in a straight line.