METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

1State Key Laboratory of General Artificial Intelligence, BIGAI, 2School of Computer Science & Technology, Beijing Institute of Technology, 3Dept. of Automation, Tsinghua University, 4School of Information Science & Technology, University of Science and Technology of China
✶ indicates equal contribution as first authors, † indicates equal contribution as secondary authors

💡Abstract

Embodied AI (EAI) research depends on high-quality and diverse 3D scenes to enable effective skill acquisition, sim-to-real transfer, and domain generalization. Recent 3D scene datasets are still limited in scalability due to their dependence on artist-driven designs and challenges in replicating the diversity of real-world objects. To address these limitations and automate the creation of 3D simulatable scenes, we present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility. Leveraging the recent advancements in object-level modeling, we provide each object with a curated set of candidates, ranked through human annotation for optimal replacements based on geometry, texture, and functionality. These annotations enable a novel multi-modal alignment model, SCAN2SIM, which facilitates automated and high-quality asset replacement. We further validate the utility of our dataset with two benchmarks: Micro-Scene Synthesos for small object layout generation and cross-domain vision-language navigation (VLN). Results confirm the potential of METASCENES to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research.

Automated replica creation, 3D indoor scene datasets, and sim-to-real transfer.

🏡Dataset

METASCENES is composed of three sequential steps: (i) Collection, where we gather diverse 3D asset candidates for each real-world object in the scan; (ii) Annotation, where annotators rank and select the best-matching 3D asset for each object based on visual similarity and geometric fit; and (iii) Optimization, where selected assets undergo post-processing and global optimization to ensure full interactivity and physical plausibility in simulation environments.

3D Visualization

Note: Interactive 3D scene viewer. Click the button below to explore.

Real2Sim Comparison

📈Experiments

Automated Replica Creation

Optimal asset retrieval model

Based on the ground truth optimal asset selection annotation in METASCENES, we learn a multi-modal alignment model to retrieve the best asset candidate from candidate assets.

Automated replica creation. We visualize the optimal asset selection results in METASCENES (left), and a digital replica automatically created via Scan2Sim on ScanNet++, before (top) and after physics-based optimization (bottom).

Micro-Scene Synthesis

We generate small objects given the large furniture.

Embodied Navigation in 3D scenes

Demonstration of the embodied agent performing goal-directed navigation in Habitat.

BibTeX

@inproceedings{yu2025metascenes,
  title={METASCENES: Towards Automated Replica Creation for Real-world 3D Scans},
  author={Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Zhu Song-Chun, Tengyu Liu, Siyuan Huang},
  booktitle=Conference on Computer Vision and Pattern Recognition(CVPR),
  year={2025}
}