Diffusion Self-Guidance for Controllable Image Generation

1UC Berkeley
2Google Research
NeurIPS 2023
“a giant macaron and a croissant in the seine with the eiffel tower visible”
Original
Move donut
Shrink donut
Replace donut
Copy scene appearance
Copy scene layout
Original
Swap objects
Enlarge macaron
Replace macaron
Copy scene appearance
Copy scene layout
“a meatball and a donut falling from the clouds onto a neighborhood”
Original
Move
Resize
Restyle
Copy appearance
Copy layout
Original
Move donut
Shrink donut
Replace donut
Copy scene appearance
Copy scene layout

TL;DR: Self-guidance is a method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model.

Without any extra models or training, you can move or resize objects, or even replace them with items from real images, without changing the rest of the scene. You can also borrow the appearance of another image or rearrange scenes into a desired layout.

draw_abstract
Abstract

Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling.

Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images.

palette Results

open_with Move and resize objects

Using self-guidance to only change the properties of one object, we can move or resize that object without modifying the rest of the image. Pick a prompt and an edit and explore for yourself.

prompt:
“a raccoon in a barrel going down a waterfall”
“distant shot of the tokyo tower with a massive sun in the sky”
“a fluffy cat sitting on a museum bench looking at an oil painting of cheese
edit:
move
move
move
move
shrink
enlarge
Original
Edited
open_with Appearance transfer from real images

By guiding the appearance of a generated object to match that of one in a real image, we can create scenes depicting an object from real life, similarly to DreamBooth, but without any fine-tuning and only using one image.

prompt:
“a photo of a chow chow wearing a ... outfit”
“a DSLR photo of a teapot...”
edit:
“purple wizard”
“chef”
“superman”
Real image
Ours
DreamBooth
open_with Real image editing

Our method also enables the spatial manipulation of objects in real images.

prompt:
“an eclair and a shot of espresso” “a hot dog, fries, and a soda on a solid background”
edit:
shrink width
reconstruct
move
enlarge
restyle
Original
Edited
open_with Sample new appearances

By guiding object shapes toward reconstruction of an image's layout, we can sample new appearances for a given scene. We compare to ControlNet v1.1-Depth and Prompt-to-Prompt. Switch between the different styles below.

prompt:
“a bear wearing a suit eating his birthday cake out of the fridge in a dark kitchen”
“a parrot riding a horse down a city street”
edit:
appearance 1
appearance 2
appearance 3
controlnet
prompt-to-prompt
Original
Edited
open_with Mix-and-match

By guiding samples to take object shapes from one image and appearance from another, we can rearrange images into layouts from other scenes. We can also sample new layouts of a scene by only guiding appearance. Find your favorite combination below.

appearance:
#1
#2
#3
#4
layout:
#1
#2
#3
#4
random #1
random #2
“a suitcase, a bowling ball, and a phone washed up on a beach after a shipwreck”
Appearance
Layout
Combined
open_with Compositional generation

A new scene can be created by collaging individual objects from different images (the first three columns here). Alternatively — e.g., if objects cannot be combined at their original locations due to incompatibilities in these images' layouts (*as in the bottom row) — we can borrow only their appearance, and specify layout with a new image to produce a composition (last two columns).

“a picnic blanket, a fruit tree, and a car by the lake”
Take blanket
Take tree
Take car
Result
+ Target layout
Final result
Take blanket
Take tree
Take car
Result
+ Target layout
Final result
“a top-down photo of a tea kettle, a bowl of fruit, and a cup of matcha”
Take matcha
Take kettle
Take fruit
Result
+ Target layout
Final result
Take matcha
Take kettle
Take fruit
Result
+ Target layout
Final result
“a dog wearing a knit sweater and a baseball cap drinking a cocktail”
Take sweater
Take cocktail
Take cap
Result*
+ Target layout
Final result
Take sweater
Take cocktail
Take cap
Result*
+ Target layout
Final result
open_with Manipulating non-objects

The properties of any word in the input prompt can be manipulated, not only nouns. Here, we show examples of relocating adjectives and verbs. The last example shows a case in which additional self-guidance can correct improper attribute binding.

Move laughing right
“a cat and a monkey laughing on a road”
Original
Modified
Change messy location
“a messy room”
At (0.3,0.6)
At (0.8,0.8)
Move red to jacket, yellow to shoes
“green hat, blue book, yellow shoes, red jacket”
Original
Fixed
“a cat and a monkey laughing on a road”
“a messy room”
“green hat, blue book, yellow shoes, red jacket”
Original
Modified
At (0.3,0.6)
At (0.8,0.8)
Original
Fixed
Move laughing right
Change messy location
Move red to jacket, yellow to shoes
open_with Limitations

Setting high guidance weights for appearance terms tends to introduce unwanted leakage of object position. Similarly, while heavily guiding the shape of one word matches that object’s layout as expected, high guidance on all token shapes leaks appearance information. Finally, in some cases, objects are entangled in attention space, making it difficult to control them independently.

Appearance features leak layout
“a squirrel trying to catch a lime mid-air”
Unguided
lime guided
Multi-token layout leaks appearance
“a picture of a cake”
Real image
Layout guided
Interacting objects are entangled
“a potato sitting on a couch with a bowl of popcorn watching football”
Original
Move potato
“a squirrel trying to catch a lime mid-air”
“a picture of a cake”
“a potato sitting on a couch with a bowl of popcorn watching football”
Unguided
lime guided
Real image
Layout guided
Original
Move potato
Appearance features leak layout
Multi-token layout leaks appearance
Interacting objects are entangled

format_quote Citation

@article{epstein2023selfguidance,
  title={Diffusion Self-Guidance for Controllable Image Generation},
  author={Epstein, Dave and Jabri, Allan and Poole, Ben and Efros, Alexei A. and Holynski, Aleksander},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}		

Acknowledgements

We thank Oliver Wang, Jason Baldridge, Lucy Chai, and Minyoung Huh for their helpful comments. Dave is supported by the PD Soros Fellowship. Dave and Allan conducted part of this research at Google, with additional funding provided by DARPA MCS and ONR MURI.