BlobGAN: Spatially Disentangled Scene Representations

1UC Berkeley
2Adobe Research

Overview
file_download Slides

Abstract

We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered “blobs” of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout.

BlobGAN generator architecture
BlobGAN blob splatting
(Left)(Top) We show our BlobGAN architecture, which incorporates spatial blob maps into the forward pass of an image generator. The layout network F maps from Gaussian noise to blob features and parameters. Blobs are splatted onto a feature grid (right)(bottom), and decoded by G into a realistic image.

We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of scene objects (e.g. moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g. plausible rooms with drawers at a particular location), and parsing of real-world images into component parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 on image quality as measured by FID.

palette Object editing

By changing properties of individual blobs, we can manipulate discovered entities in images of scenes.
open_with Move blobs
Changing blob position x changes the location of objects in a scene.
open_in_full Resize blobs
Changing blob scale s changes objects' size. Setting s to a negative value hides the blob, removing the entity from the scene.
brush Restyle blobs
Each blob has a structure feature \phi and a style feature \psi . By changing \psi , we can effortlessly restyle specific entities in a scene while leaving the remainder unaltered.
approval Clone blobs
Our decoder G also accepts out-of-distribution blob maps as input. Here, we show the effect of copying and pasting blobs corresponding to various entities. The resultant images depict scenes with the object faithfully duplicated. Our model is also able to render empty rooms by only passing in the background feature; see below.

auto_awesome Scene auto-completion

By inverting through our layout network F , we can sample scenes that match user-specified constraints. We do this by choosing different random noise vectors z^\text{init} and optimizing F\left(z^\text{init}\right) to fit desired blob properties, leaving other blob parameters to vary. Many conditional generation problems fit under this umbrella. We show a few examples here.
format_color_fill Fill in empty rooms
We can fill in empty rooms (shown below on left) by taking background features \phi_0, \psi_0 from a source image and finding other noise vectors z^\text{optim} that match them.
sync_alt Better style transfer
Previous work allows taking the style from one image and swapping it into another image while preserving structure. We observe that not every image's style is compatible with another's structure. Using auto-complete to find new scenes with the same layout \{x_i, s_i, a_i, \theta_i\}_{i=1}^k leads to more faithful and photorealistic results than swapping randomly.
dashboard Partial layout conditioning
Other models learn to synthesize scenes based on partial layout as specified by class-labeled bounding boxes. BlobGAN allows the same functionality despite training without supervision by conditioning auto-complete on the j^\text{th} blob's parameters x_j, s_j, a_j, \theta_j . For example, we can find different rooms with dressers or beds at a given location.

database Multi-dataset BlobGAN

In addition to models trained on images of bedrooms, we train a model on conference rooms as well as one on a multi-category dataset of kitchens, living rooms, and dining rooms. Similarly to above, our model discovers entities in the scene, even across different classes.
open_with Move blobs
Interestingly, we find that the same blob (shown in coral moving below) corresponds to coffee tables, kitchen islands, and dining tables, depending on structure vector \phi .
open_in_full Resize blobs
emoji_food_beverage Customize your WFH background!
Look like you're hard at work by dragging an office desk blob into view or putting some big screens on the wall behind you.

photo_camera Parsing real images

Our model is not limited to generated images and can also be applied to parse real images into component regions and edit them. Here, we show results on editing unseen images of real bedrooms.

content_copy Citation
file_download Paper PDF

@misc{epstein2022blobgan,
      title={BlobGAN: Spatially Disentangled Scene Representations},
      author={Dave Epstein and Taesung Park and Richard Zhang and Eli Shechtman and Alexei A. Efros},
      year={2022},
      eprint={2205.02837},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}		

volunteer_activism Acknowledgements

We thank Allan Jabri, Assaf Shocher, Bill Peebles, Tim Brooks, and Yossi Gandelsman for endless insightful discussions and important feedback, and especially thank Vickie Ye for advice on blob compositing, splatting, and visualization. Thanks also to Georgios Pavlakos for deadline-week pixel inpsection and Shiry Ginosar for post-deadline-week guidance and helpful comments. Research was supported in part by the DARPA MCS program. Dave is supported by the PD Soros Fellowship.