IKEA-Manual: Seeing Shape Assembly Step by Step

NeurIPS 2022 Datasets and Benchmarks Track

Ruocheng Wang 1    Yunzhi Zhang 1    Jiayuan Mao 2    Ran Zhang 3*    Chin-Yi Cheng 4*    Jiajun Wu 1
1Stanford University          2MIT CSAIL          3Autodesk          4Google Research
* Work done when working at Autodesk AI Lab.

[Paper] [BibTeX] [Dataset & Code Download] [Video]


Human-designed visual manuals are crucial components in shape assembly activities. They provide step-by-step guidance on how we should move and connect different parts in a convenient and physically-realizable way. While there has been an ongoing effort in building agents that perform assembly tasks, the information in human-design manuals has been largely overlooked. We identify that this is due to 1) a lack of realistic 3D assembly objects that have paired manuals and 2) the difficulty of extracting structured information from purely image-based manuals. Motivated by this observation, we present IKEA-Manual, a dataset consisting of 102 IKEA objects paired with assembly manuals. We provide fine-grained annotations on the IKEA objects and assembly manuals, including decomposed assembly parts, assembly plans, manual segmentation, and 2D-3D correspondence between 3D parts and visual manuals. We illustrate the broad application of our dataset on four tasks related to shape assembly: assembly plan generation, part segmentation, pose estimation and 3D part assembly.

Overview of IKEA-Manual


We present IKEA-Manual, a dataset for step-by-step understanding of shape assembly from 3D models and human-designed visual manuals. (a) IKEA-Manual contains 102 3D IKEA objects paired with human-designed visual manuals, where each object is decomposed into primitive assembly parts that match manuals shown in different colors. (b) The original IKEA manuals provide step-by-step guidance on the assembly process by showing images of how parts are connected. (c) We extract a high-level, tree-structured assembly plan from the visual manual, specifying how parts are connected during the assembly process. (d) For each step, we provide dense visual annotation such as 2D part segmentation and 2D-3D correspondence between 2D manual images and 3D parts.


We demonstrate the usefulness of our dataset on the following three tasks.

Manual Plan Generation

The task of manual plan generation refers to generating plans for the order of assembly actions. Here, we are interested in whether we can build algorithms that automatically generate assembly plans. We represent the assembly plan as a tree and develope several evaluation metrics to evaluate the accuracy of the generated assembly tree.

Part Segmentation

The task of part segmentation aims to help an agent to understand visual manuals. To understand the assembly information of a manual, an agent must first locate the assembly parts illustrated in the manual and then infer their 3D positions by finding 2D-3D correspondences. Here we focus on the part-conditioned manual segmentation task: given a list of parts in their canonical space, and a manual image that contains their 2D projections, we want to predict the segmentation mask for each part.

Part Assembly

The task of part assembly aims to infer the assembled shapes from canonical 3D assembly parts. Specifically, given a list of parts in point clouds, the goal is to predict a 3D pose for each part such that all parts in their predicted poses constitute a reasonable final shape.