AeroScene: Progressive Scene Synthesis for Aerial Robotics

Abstract

Generative models have shown substantial impact across multiple domains, their potential for scene synthesis remains underexplored in robotics. This gap is more evident in drone simulators, where simulation environments still rely heavily on manual efforts, which are time-consuming to create and difficult to scale. In this work, we introduce AeroScene, a hierarchical diffusion model for progressive 3D scene synthesis. Our approach leverages hierarchy-aware tokenization and multi-branch feature extraction to reason across both global layouts and local details, ensuring physical plausibility and semantic consistency. This makes AeroScene particularly suited for generating realistic scenes for aerial robotics tasks such as navigation, landing, and perching. We demonstrate its effectiveness through extensive experiments on our newly collected dataset and a public benchmark, showing that AeroScene significantly outperforms prior methods. Furthermore, we use AeroScene to generate a large-scale dataset of over 1,000 physics-ready, high fidelity 3D scenes that can be directly integrated into NVIDIA Isaac Sim. Finally, we illustrate the utility of these generated environments on downstream drone navigation tasks.

Problem Statement

Drones are increasingly applied in delivery, inspection, and surveillance, requiring them to navigate and operate within complex 3D environments. Synthesizing realistic scenes for such applications necessitates hierarchical layout generation, where coarse-scale structures (e.g., rooms, terrains, building layouts) establish navigable flight corridors, while fine-scale details (e.g., obstacles, landing areas) ensure task-specific feasibility. However, existing scene creation methods for drone simulators are designed primarily by humans, making them challenging to scale, and often fail to accommodate both physical and fidelity requirements such as unobstructed aerial navigation, accessible interaction areas (e.g., landing pads, inspection points), and coherent indoor-outdoor transitions. In this paper, we propose AeroScene, a hierarchical diffusion-based framework for 3D scene generation tailored to drone tasks. Our approach operates across scales: coarse-scale synthesis generates high-level structures that preserve airspace and navigability, while fine-scale synthesis refines object placement and specifies drone-interaction areas for task execution. AeroScene includes Cross-scale Progressive Attention, which explicitly models dependencies across scales, ensuring that fine-scale details remain consistent with coarse-scale spatial structures. In addition, we design task-aware guidance functions that encourage collision-free plausibility, maintain semantic correlations in hierarchical orders, and handle relationships between indoor and outdoor objects, thereby aligning generated layouts with real-world aerial operation requirements. To support downstream tasks, scenes created by our method are directly embedded into NVIDIA Isaac Sim for physics-ready simulation.

Methodology

We present AeroScene, a hierarchical diffusion model for generating drone-navigable 3D indoor-outdoor scenes. The framework operates in two complementary scales: coarse-grained convolutional blocks extract high-level layout features (e.g., building footprints, terrain, major functional zones), while fine-grained blocks capture detailed object geometry and placement. These multi-scale features are fused via a novel Cross-scale Progressive Attention module that performs bidirectional top-down and bottom-up interactions, ensuring fine details remain semantically and spatially consistent with coarse structures. During denoising, the model incorporates task-aware guidance objectives — including collision avoidance (preventing object overlap and preserving free flight corridors), coarse-to-fine consistency, and semantic constraint alignment — to produce layouts that are both visually plausible and physically feasible for aerial navigation. The final denoised hierarchical layout is converted into a tokenized scene representation and rendered into realistic 3D environments compatible with physics simulation.

Experiments

The table below shows that our proposed AeroScene method consistently outperforms existing scene synthesis approaches, including both autoregressive and diffusion-based baselines, across our custom drone-oriented dataset and the standard 3D-FRONT benchmark. It achieves the best scores on all five metrics—FID and KID for distribution fidelity, and CR, CFC, and SP for collision avoidance and navigability—demonstrating superior generation of realistic, collision-free, and drone-friendly indoor-outdoor layouts thanks to the hierarchical diffusion design, cross-scale progressive attention, and task-aware guidance.

**Quantitative results on scene synthesis.**
Method	FID↓	KID↓	CR(%)↓	CFC↓	SP↓
Our Dataset
ATISS	45.2	0.032	12.5	0.21	3.8
Diffusion-SDF	38.7	0.028	10.1	0.18	3.5
DiffuScene	32.4	0.025	8.3	0.15	3.2
PhyScene	29.8	0.023	7.1	0.13	3.0
Ours	27.3	0.021	6.2	0.12	2.7
3D-FRONT Dataset
ATISS	42.1	0.030	11.8	0.19	3.6
Diffusion-SDF	35.6	0.026	9.4	0.16	3.3
DiffuScene	30.2	0.023	7.6	0.14	3.0
PhyScene	27.9	0.021	6.3	0.12	2.7
Ours	25.8	0.019	5.5	0.11	2.5

BibTeX

SOON...