Disentangled 3D Scene Gen­eration with Layout Learning

1UC Berkeley
2Google Research
ICML 2024

TL;DR: Text-to-3D generation of scenes decomposed automatically into objects, using only a pretrained diffusion model.

TL;DR: Text-to-3D generation of scenes that are decomposed automatically into objects.

Entire scene
Scene
Objects
1
2
3
4
Objects:
1
2
3
4

Click the numbers to discover different objects!

Choose a prompt:
“a chef rat on a tiny stool cooking a stew”
“a chicken hunting for easter eggs”
“a pigeon having some coffee and a bagel, reading the newspaper”
“two dogs in matching outfits padding a kayak”
“a sloth on a beanbag with popcorn and a remote control”
“a bald eagle having a drink and a burger at the park”
“a bear wearing a flannel camping and reading a book by the fire”

draw_abstract
Abstract

We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene.

Concretely, our method jointly optimizes multiple NeRFs from scratch – each representing its own object – along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation.

palette Applications

open_with Building scenes around objects

We can take advantage of our structured representation to learn a scene given a 3D asset in addition to a text prompt, such as a specific cat or motorcycle. By freezing the NeRF weights but not the layout weights, the model learns to arrange the provided asset in the context of the other objects it discovers. We show the entire scenes the model creates along with surface normals and a textureless render.

Choose an asset:
😾 grumpy cat
🏍 sport motorbike
Choose a prompt:
“a cat wearing a santa costume holding a present next to a miniature christmas tree”
“a cat wearing a hawaiian shirt and sunglasses, having a drink on a beach towel”
Entire scene
Scene
Lock Asset
Objects
1
2
3
Objects:
1
2
3
open_with Separating a NeRF into objects

Given a NeRF representing a scene and a caption, layout learning is able to parse the scene into the objects it contains without any per-object supervision. We accomplish this by requiring renders of one of the N learned layouts to match the same view rendered from the target NeRF, using a simple L_2 reconstruction loss.

Choose a prompt:
“a bird having some sushi and sake”
“two cute cats wearing baseball uniforms playing catch”
Original
Orig.
Reconstructed Reconstr. Recon.
Objects
1
2
3
4
Objects:
1
2
3
4
open_with Automatically arranging existing objects

Allowing gradients to flow only into layout parameters while freezing a set of provided 3D assets results in reasonable object configurations, such as a chair tucked into a table with spaghetti on it, despite no such guidance being provided in the text conditioning.

Choose assets:
🍝 spaghetti / 🪑 chair / 🔴 table
🦆 rubber duck / 🛁 bathtub / 🚿 shower head
🖥️ monitor / ⌨️ keyboard / 🖱️ mouse
\LARGE \boldsymbol{\rightarrow}
\LARGE \boldsymbol{\downarrow}
open_with Sampling different layouts

Our method discovers different plausible arrangements for objects. Here, we optimize each example over N=4 layouts and show differences in composited scenes, e.g. flamingos wading inside vs. beside the pond.

Choose a prompt:
“two flamingos sipping on cocktails in a desert oasis”
“a robe, a pair of slippers, and a candle”

Analysis

open_with Ablation

We optimize different variants of our model with K=3 NeRFs on a list of 30 prompts, each containing three objects. Training K NeRFs provides some decomposition, but most objects are scattered across 2 or 3 models. Learning one layout alleviates some of these issues, but only with multiple layouts do we see strong disentanglement. We show two representative examples of emergent objects to visualize these differences.

Choose a prompt:
“a backpack, water bottle, and bag of chips”
“a slice of cake, vase of roses, and bottle of wine”
Choose a model variant:
Learn K NeRFs
+ learn 1 layout
+ learn N layouts
Per-object SDS
open_with Limitations

Layout learning inherits failure modes from SDS, such as the Janus problem. It also may undesirably group objects that always move together, such as a horse and its rider, or segment objects in undesired ways, such as breaking off an arm from the ninja's body. For certain prompts that generate many small objects, choosing K correctly is challenging, hurting disentanglement. Finally, in some cases, despite different initial values, layouts converge to very similar final configurations.

Choose a prompt:
“some astronauts forming a human pyramid riding a horse”
“two fancy llamas enjoying a tea party”
“a ninja slicing different fruit in mid-air with a katana”
“a monkey having a whiskey and a cigar, using a typewriter”
\LARGE \boldsymbol{\rightarrow}
\LARGE \boldsymbol{\downarrow}

format_quote Citation

@misc{epstein2024disentangled,
      title={Disentangled 3D Scene Generation with Layout Learning},
      author={Dave Epstein and Ben Poole and Ben Mildenhall and Alexei A. Efros and Aleksander Holynski},
      year={2024},
      eprint={2402.16936},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}		

Acknowledgements

We thank Dor Verbin, Ruiqi Gao, Lucy Chai, and Minyoung Huh for their helpful comments, and Arthur Brussee for help with an NGP implementation. DE was partly supported by the PD Soros Fellowship. DE conducted part of this research at Google, with additional funding from an ONR MURI grant.