Affordance-Aware Articulation Synthesis

Demo

Given arbitrary scene and open-domain rigged objects, A3Syn synthesizes articulation that respects the affordance and context.

Abstract

Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.

Overview

Left. Our A3Syn takes four inputs: The scene geometry, a rigged object, a text prompt describes the desired articulation, and an approximate location to perform the pose. The goal is to solve the object transformation and articulation parameters.

Middle. Our first stage aims to synthesize a course proposal posture, then optimizes the single-view pixel coordinate alignment with the current rest pose. The processing is fully and efficiently differentiable by using differentiable rendering and semantic correspondence.

Right. In the second stage, we use a combination of grid prior and partial denoising to synthesize cross-view consistent affordance reference, then optimizes the alignment in multiple views. In both stages, the optimization objective is equivalent to explicit 3D deformation, and we show such an optimization has a steady convergence.

Results

Jumping down from a wooden chair to the ground.

Running on a wooden bridge.

Climbing stairs.

Climbing a tree.

The affordance-aware articulation synthesized with our A3Syn. For each scene-prompt-location composition, we use three different objects to show that our algorithm can adapt to arbitrary open-domain objects, maintain the physical soundness, and be aware of the object semantics (e.g., the rabbit has a different jumping posture, the cat and dog has different tail signatures). Most importantly, the same object adapts to distinctive postures accord to different scene geometries, showing our results captures the nuance of affordance: the complementarity between the animal and the environment.

Comparison

Comparisons. SDS has a limited pose change from the rest pose, or creates unnaturally distorted limbs (e.g., the legs of the shiba inu and rabbit). Our method produces more natural posture, while the added articulation better resembles the affordance.

Ablation Study

Ablation study. We show a sample of dog attempting to climb tree from two views. Removing bone rotation penalty (BR) causes unnatural limb bending, while omitting our second stage multi-view alignment (MV) leads to floating due to single-view depth ambiguity. Combining all methods lead to the best posture.

Convergence Analysis

Our approach provides steady convergence. We visualize the bone rotation in degrees (y-axis) to optimization iterations (x-axis), each line represents a unique bone. All methods use similar hyperparameters. Our approach (no learning rate decay) has a clear converge direction. In contrast, SDS does not have a consistent converge target, even with HiFA scheduling that sets a low noise rate by the end of optimization. Adding learning rate scheduling mitigates the issue (still unstable at end), but restricts the change in angle.

Intermediate Visualization

Intermediate steps for multi-view fine-grained alignment stage. Text prompt: A brown rabbit in mid-leap as it jumps down from a wooden chair. In this example, the left hind leg of the rabbit are wrongly placed after the single-view placement stage (initial pose), appearing in an unnatural position. During the multi-view alignment stage, the posture is iteratively refined, with the left hind leg gradually adjusting to align more realistically with the action described in the prompt.

Bibtex

@article{yu2025towards, title={Towards Affordance-Aware Articulation Synthesis for Rigged Objects}, author={Yu, Yu-Chu and Lin, Chieh Hubert and Lee, Hsin-Ying and Wang, Chaoyang and Wang, Yu-Chiang Frank and Yang, Ming-Hsuan}, journal={arXiv preprint}, year={2025} }

Acknowledgement

We thank all the following artists for creating the 3D objects used in our work and generously shared them for free on Sketchfab.com: 3d modelling my cat: Fripouille by guillaume bolis, Cow NPC by Owlish Media, Horse Rigged(Game Ready) by abhayexe, Low poly fox running animation by dragonsnap, Shiba Inu Doggy by aaadragon, Rabbit Rigged by FourthGreen, Spongebob. Rigged by Eyeball, Patrick. Rigged by Eyeball, Venice city scene 1DAE08 Aaron Ongena by AaronOngena, 1DAE10 Quintyn Glenn City Scene Kyoto by Glenn.Quintyn, Low Poly Farm V2 by EdwiixGG. All 3D objects are licensed under CC Attribution.