Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

Abstraction

The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality characteristics. Based on KP as a mediator, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions and to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel automatic motion generation benchmark. In extensive experiments, our approach shows superiority over other methods.

Methods

The huge gap between motion and action semantics results in the many-to-many problem. We propose Kinematic Phrases (KP) as an intermediate to bridge the gap. KPs objectively capture human kinematic cues. % from different levels. It properly abstracts diverse motions with interpretability. As shown, the Phrases in the yellow box could capture key patterns of "walk" for diverse motions.

The designed KP enables us to unify motion data with different formats to construct a large-scale knowledge base containing motion, text, and KP. We collect 192 thousands motion sequences from 12 datasets as our Kinematic Phrase Base, containing 48 thousands text descriptions with 7,000 words.

Six types of KP from four kinematic hierarchies are extracted from a motion sequence. A scalar indicator is calculated per Phrase per frame. Its sign categorizes the corresponding Phrase.

We train motion-KP joint latent space in a self-supervised training manner. KP is randomly masked during training. Reconstruction and alignment losses are adopted. The joint space could be applied for multiple tasks, including motion interpolation, modification, and generation.

Kinematic Prompt Generation

Moreover, we propose a novel benchmark named Kinematic Prompts Generation for motion generation. KPG focuses on evaluating whether the models could generate motion sequences corresponding to specific kinematic facts given text prompts. In specific, the model is required to generate motion sequences upon text prompts converted from KP. Then, the semantic consistency could be directly examined by comparing the original KP and the KP converted from the generated motion sequence.

Experiment and user-study results show the consistency between KP-inferred accuracy and user reviews. Our KP-based method manage provide superior performance for both automatic metrics and user study.