Embodied AI (EAI) research depends on high-quality and diverse 3D scenes to enable effective skill acquisition, sim-to-real transfer, and domain generalization. Recent 3D scene datasets are still limited in scalability due to their dependence on artist-driven designs and challenges in replicating the diversity of real-world objects. To address these limitations and automate the creation of 3D simulatable scenes, we present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility. Leveraging the recent advancements in object-level modeling, we provide each object with a curated set of candidates, ranked through human annotation for optimal replacements based on geometry, texture, and functionality. These annotations enable a novel multi-modal alignment model, SCAN2SIM, which facilitates automated and high-quality asset replacement. We further validate the utility of our dataset with two benchmarks: Micro-Scene Synthesos for small object layout generation and cross-domain vision-language navigation (VLN). Results confirm the potential of METASCENES to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research.
METASCENES is composed of three sequential steps: (i) Collection, where we gather diverse 3D asset candidates for each real-world object in the scan; (ii) Annotation, where annotators rank and select the best-matching 3D asset for each object based on visual similarity and geometric fit; and (iii) Optimization, where selected assets undergo post-processing and global optimization to ensure full interactivity and physical plausibility in simulation environments.
Based on the ground truth optimal asset selection annotation in METASCENES, we learn a multi-modal alignment model to retrieve the best asset candidate from candidate assets.
Automated replica creation. We visualize the optimal asset selection results in METASCENES (left), and a digital replica automatically created via Scan2Sim on ScanNet++, before (top) and after physics-based optimization (bottom).
We generate small objects given the large furniture.
Diverse results of the micro-scene synthesis. The model is capable of generating varied layouts for the same large furniture.
Demonstration of the embodied agent performing goal-directed navigation in Habitat.
@inproceedings{yu2025metascenes,
title={METASCENES: Towards Automated Replica Creation for Real-world 3D Scans},
author={Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, Siyuan Huang},
booktitle=Conference on Computer Vision and Pattern Recognition(CVPR),
year={2025}
}