Where2Act: From Pixels to Actions for
Articulated 3D Objects

Kaichun Mo*1     Leonidas J. Guibas1     Mustafa Mukadam2    
Abhinav Gupta2     Shubham Tulsiani2    

1 Stanford University     2 Facebook AI Research

*work mostly done while interning at FAIR.

International Conference on Computer Vision (ICCV) 2021

[ArXiv] [Code] [Video] [Slides] [Poster] [BibTex]


One of the fundamental goals of visual perception is to allow agents to meaningfully interact with their environment. In this paper, we take a step towards that long-term goal -- we extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts. For example, given a drawer, our network predicts that applying a pulling force on the handle opens the drawer. We propose, discuss, and evaluate novel network architectures that given image and depth data, predict the set of actions possible at each pixel, and the regions over articulated parts that are likely to move under the force. We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation (SAPIEN) and generalizes across categories.

The Proposed Where2Act Task

Figure 1. Given as input an articulated 3D object, we learn to propose the actionable information for different robotic manipulation primitives (e.g. pushing, pulling): (a) the predicted actionability scores over pixels; (b) the proposed interaction trajectories, along with (c) their success likelihoods, for a selected pixel highlighted in red. We show two high-rated proposals (left) and two with lower scores (right) due to interaction orientations and potential robot-object collisions.

5-min Video Presentation
Simulation Environment

Figure 2. We use SAPIEN as our main testbed. (a) Our interactive simulation environment: we show the local gripper frame by the red, green and blue axes, which corresponds to the leftward, upward and forward directions respectively; (b) Six types of action primitives parametrized in the SE(3) space: we visualize each pre-programmed motion trajectory by showing the three key frames, where the time steps go from the transparent grippers to the solid ones, with 3x exaggerated motion ranges.

Simulation Assets Visualization

Figure 3. We use the PartNet-Mobility dataset. We visualize one example for each of the 15 object categories we use in our work.

Network Architecture

Figure 4. Our network takes an 2D image or a 3D partial scan as input and extract per-pixel feature using (a) Unet for 2D images and (b) PointNet++ for 3D point clouds. To decode the per-pixel actionable information, we propose three decoding heads: (c) an actionability scoring module Da that predicts a score; (d) an action proposal module Dr that proposes multiple gripper orientations sampled from a uniform Gaussian random noise; (e) an action scoring module Ds that rates the confidence for each proposal.

Action Scoring Prediction Results

Figure 5. We visualize the per-pixel action scoring predictions over the articulated parts given certain gripper orientations for interaction. In each set of results, the left two shapes shown in blue are testing shapes from training categories, while the middle two shapes highlighted in dark red are shapes from testing categories. The rightmost columns show the results for the 2D experiments.

Action Proposal Prediction Results

Figure 6. We visualize the top-10 action proposal predictions (motion trajectories are 3x exaggerated) for some example testing shapes under each action primitive. The bottom row presents the cases that no action proposal is predicted, indicating that these pixels are not actionable under the action primitives.

Real-world Results

Figure 7. We visualize our action scoring predictions over realworld 3D scans from the Replica dataset and one Google Scanned Object, as well as on a 2D real image. Here, results are shown over all pixels since we have no access to the articulated part masks. Though there is no guarantee for the predictions over pixels outside the articulated parts, the results make sense if we allow motion for the entire objects.

Failure Cases

Figure 8. We visualize some interesting failure cases, which demonstrate the difficulty of the task and some ambiguous cases that are hard for robot to figure out. For the pushing action, we show (a) an example of gripper-object invalid collision at the initial state, thus leading to failed interaction, though the interaction direction seems to be successful; (b) a failed interaction due to the fact that the part motion does not surpass the required amount 0.01 since the interaction direction is quite orthogonal to the drawer surface; and (c) a case that the door is fully closed and thus not pushable, though there are cases that the doors can be pushed inside in the dataset. For the pulling action, we present (d) a failed grasping attempt since the gripper is too small and the pot lid is too heavy; (e) a case illustrating the intrinsic ambiguity that the robot does not know from which side the door can be opened; and (f) a failed pulling attempt as the switch toggle already reaches the allowed maximal motion range.


This work was supported primarily by Facebook during Kaichun's internship, while also by NSF grant IIS-1763268, a Vannevar Bush faculty fellowship, and an Amazon AWS ML award. We thank Yuzhe Qin and Fanbo Xiang for providing helps on setting up the SAPIEN environment.