METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

Huangyue Yu^1✶, Baoxiong Jia^1✶, Yixin Chen^1✶, Yandan Yang^1†, Puhao Li^1,3†, Rongpeng Su^1,4†, Jiaxin Li^1,2, Qing Li¹, Wei Liang², Song-Chun Zhu¹, Tengyu Liu¹, Siyuan Huang¹,

¹State Key Laboratory of General Artificial Intelligence, BIGAI, ²School of Computer Science & Technology, Beijing Institute of Technology, ³Dept. of Automation, Tsinghua University, ⁴School of Information Science & Technology, University of Science and Technology of China

✶ indicates equal contribution as first authors, † indicates equal contribution as secondary authors

💡Abstract

Embodied AI (EAI) research depends on high-quality and diverse 3D scenes to enable effective skill acquisition, sim-to-real transfer, and domain generalization. Recent 3D scene datasets are still limited in scalability due to their dependence on artist-driven designs and challenges in replicating the diversity of real-world objects. To address these limitations and automate the creation of 3D simulatable scenes, we present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility. Leveraging the recent advancements in object-level modeling, we provide each object with a curated set of candidates, ranked through human annotation for optimal replacements based on geometry, texture, and functionality. These annotations enable a novel multi-modal alignment model, SCAN2SIM, which facilitates automated and high-quality asset replacement. We further validate the utility of our dataset with two benchmarks: Micro-Scene Synthesos for small object layout generation and cross-domain vision-language navigation (VLN). Results confirm the potential of METASCENES to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research.

Automated replica creation, 3D indoor scene datasets, and sim-to-real transfer.

🏡Dataset

METASCENES is composed of three sequential steps: (i) Collection, where we gather diverse 3D asset candidates for each real-world object in the scan; (ii) Annotation, where annotators rank and select the best-matching 3D asset for each object based on visual similarity and geometric fit; and (iii) Optimization, where selected assets undergo post-processing and global optimization to ensure full interactivity and physical plausibility in simulation environments.

3D Visualization

Note: Interactive 3D scene viewer. Click the button below to explore.

Real2Sim Comparison

📈Experiments

Automated Replica Creation

Optimal asset retrieval model

Based on the ground truth optimal asset selection annotation in METASCENES, we learn a multi-modal alignment model to retrieve the best asset candidate from candidate assets.

Automated replica creation. We visualize the optimal asset selection results in METASCENES (left), and a digital replica automatically created via Scan2Sim on ScanNet++, before (top) and after physics-based optimization (bottom).