BlobGAN: Spatially Disentangled Scene Representations
Overview
Abstract
We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered “blobs” of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout.
We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of scene objects (e.g. moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g. plausible rooms with drawers at a particular location), and parsing of real-world images into component parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 on image quality as measured by FID.
palette Object editing
By changing properties of individual blobs, we can manipulate discovered entities in images of scenes.open_with Move blobs
Changing blob positionopen_in_full Resize blobs
Changing blob scalebrush Restyle blobs
Each blob has a structure featureapproval Clone blobs
Our decoderauto_awesome Scene auto-completion
By inverting through our layout networkformat_color_fill Fill in empty rooms
We can fill in empty rooms (shown below on left) by taking background featuressync_alt Better style transfer
Previous work allows taking the style from one image and swapping it into another image while preserving structure. We observe that not every image's style is compatible with another's structure. Using auto-complete to find new scenes with the same layoutdashboard Partial layout conditioning
Other models learn to synthesize scenes based on partial layout as specified by class-labeled bounding boxes. BlobGAN allows the same functionality despite training without supervision by conditioning auto-complete on thedatabase Multi-dataset BlobGAN
In addition to models trained on images of bedrooms, we train a model on conference rooms as well as one on a multi-category dataset of kitchens, living rooms, and dining rooms. Similarly to above, our model discovers entities in the scene, even across different classes.open_with Move blobs
Interestingly, we find that the same blob (shown in coral moving below) corresponds to coffee tables, kitchen islands, and dining tables, depending on structure vectoropen_in_full Resize blobs
emoji_food_beverage Customize your WFH background!
Look like you're hard at work by dragging an office desk blob into view or putting some big screens on the wall behind you.photo_camera Parsing real images
Our model is not limited to generated images and can also be applied to parse real images into component regions and edit them. Here, we show results on editing unseen images of real bedrooms.content_copy Citation
@article{epstein2022blobgan, title={BlobGAN: Spatially Disentangled Scene Representations}, author={Dave Epstein and Taesung Park and Richard Zhang and Eli Shechtman and Alexei A. Efros}, journal={European Conference on Computer Vision (ECCV)}, year={2022} }
volunteer_activism
Acknowledgements
We thank Allan Jabri, Assaf Shocher, Bill Peebles, Tim Brooks, and Yossi Gandelsman for endless insightful discussions and important feedback, and especially thank Vickie Ye for advice on blob compositing, splatting, and visualization. Thanks also to Georgios Pavlakos for deadline-week pixel inpsection and Shiry Ginosar for post-deadline-week guidance and helpful comments. Research was supported in part by the DARPA MCS program and a gift from Adobe Research. This work was started while DE was an intern at Adobe.