Diffusion Self-Guidance for Controllable Image Generation

Dave Epstein^1,2

Allan Jabri¹

Ben Poole²

Alexei A. Efros¹

Aleksander Hołyński^1,2

¹UC Berkeley

²Google Research

NeurIPS 2023

Paper Talk Demo Results Cite PDF

“a meatball and a donut falling from the
                                clouds onto a neighborhood”

Original

Move donut

Resize donut

Replace donut

Copy scene appearance

Copy scene layout

“a macaron and a croissant in the
                                seine with the eiffel tower visible”

Original

Swap objects

Enlarge macaron

Replace macaron

Copy scene appearance

Copy scene layout

“a giant macaron and a croissant in the seine with
                the eiffel tower visible”

Original

Move donut

Shrink donut

Replace donut

Copy scene appearance

Copy scene layout

Original

Swap objects

Enlarge macaron

Replace macaron

Copy scene appearance

Copy scene layout

“a meatball and a donut falling from the clouds onto a
                neighborhood”

Original

Move

Resize

Restyle

Copy appearance

Copy layout

Original

Move donut

Shrink donut

Replace donut

Copy scene appearance

Copy scene layout

TL;DR: Self-guidance is a method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model.

Without any extra models or training, you can move or resize objects, or even replace them with items from real images, without changing the rest of the scene. You can also borrow the appearance of another image or rearrange scenes into a desired layout.

draw_abstract
Abstract

Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling.

Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images.

palette Results

open_with Move and resize objects

Using self-guidance to only change the properties of one object, we can move or resize that object without modifying the rest of the image. Pick a prompt and an edit and explore for yourself.

prompt:

“a raccoon
                in a
                barrel going
                down a waterfall”

“distant
                    shot of the
                    tokyo tower with a massive sun in the sky”

“a fluffy cat
                    sitting on a museum bench
                    looking at an oil painting of cheese”

edit:

move ↑

move ↓

move ←

move →

shrink

enlarge

Original

Edited

open_with Appearance transfer from real images

By guiding the appearance of a generated object to match that of one in a real image, we can create scenes depicting an object from real life, similarly to DreamBooth, but without any fine-tuning and only using one image.

prompt:

“a photo of a chow chow wearing a
                ... outfit”

“a DSLR photo
                of a teapot...”

edit:

“purple wizard”

“chef”

“superman”

edit:

“floating in
                milk”

“pouring tea”

“floating in the
                sea”

Real image

Ours

DreamBooth

open_with Real image editing

Our method also enables the spatial manipulation of objects in real images.

prompt:

“an
                    eclair and a shot of espresso”

“a hot
                    dog, fries, and a soda on a solid background”

edit:

shrink width

reconstruct

move

enlarge

restyle

edit:

make narrow and tall

restyle

shrink width

reconstruct

swap soda and fries

Original

Edited

open_with Sample new appearances

By guiding object shapes toward reconstruction of an image's layout, we can sample new appearances for a given scene. We compare to ControlNet v1.1-Depth and Prompt-to-Prompt. Switch between the different styles below.

prompt:

“a
                bear wearing a suit eating his birthday cake out of the fridge in a dark kitchen”

“a parrot riding
                a
                horse down a city street”

edit:

appearance 1

appearance 2

appearance 3

controlnet

prompt-to-prompt

Original

Edited

open_with Mix-and-match

By guiding samples to take object shapes from one image and appearance from another, we can rearrange images into layouts from other scenes. We can also sample new layouts of a scene by only guiding appearance. Find your favorite combination below.

appearance:

#1

#2

#3

#4

layout:

#1

#2

#3

#4

random #1

random #2

“a suitcase, a bowling ball, and a phone
                washed up on a
                beach after a shipwreck”

Appearance

Layout

Combined

open_with Compositional generation

A new scene can be created by collaging individual objects from different images (the first three columns here). Alternatively — e.g., if objects cannot be combined at their original locations due to incompatibilities in these images' layouts (*as in the bottom row) — we can borrow only their appearance, and specify layout with a new image to produce a composition (last two columns).

“a picnic blanket, a fruit tree, and a car by the
                lake”

Take blanket

Take tree

Take car

Result

+ Target layout

Final result

Take blanket

Take tree

Take car

Result

+ Target layout

Final result

“a top-down photo of a tea kettle, a bowl of fruit, and a cup of
                matcha”

Take matcha

Take kettle

Take fruit

Result

+ Target layout

Final result

Take matcha

Take kettle

Take fruit

Result

+ Target layout

Final result

“a dog wearing a knit sweater and a baseball cap drinking a
                cocktail”

Take sweater

Take cocktail

Take cap

Result*

+ Target layout

Final result

Take sweater

Take cocktail

Take cap

Result*

+ Target layout

Final result

open_with Manipulating non-objects

The properties of any word in the input prompt can be manipulated, not only nouns. Here, we show examples of relocating adjectives and verbs. The last example shows a case in which additional self-guidance can correct improper attribute binding.

Move laughing right

“a cat and a monkey laughing on a road”

Original

Modified

Change messy location

“a messy room”

At (0.3,0.6)

At (0.8,0.8)

Move red to jacket, yellow to shoes

“green hat, blue book, yellow shoes, red jacket”

Original

Fixed

“a cat and a monkey laughing on a road”

“a messy room”

“green hat, blue book, yellow shoes, red jacket”

Original

Modified

At (0.3,0.6)

At (0.8,0.8)

Original

Fixed

Move laughing right

Change messy location

Move red to jacket, yellow to shoes

open_with Limitations

Setting high guidance weights for appearance terms tends to introduce unwanted leakage of object position. Similarly, while heavily guiding the shape of one word matches that object’s layout as expected, high guidance on all token shapes leaks appearance information. Finally, in some cases, objects are entangled in attention space, making it difficult to control them independently.

Appearance features leak layout

“a squirrel trying to catch a lime mid-air”

Unguided

lime guided

Multi-token layout leaks appearance

“a picture of a cake”

Real image

Layout guided

Interacting objects are entangled

“a potato sitting on a couch with a bowl of popcorn watching
                football”

Original

Move potato →

“a squirrel trying to catch a lime mid-air”

“a picture of a cake”

“a potato sitting on a couch with a bowl of popcorn watching football”

Unguided

lime guided

Real image

Layout guided

Original

Move potato →

Appearance features leak layout

Multi-token layout leaks appearance

Interacting objects are entangled

format_quote Citation

@article{epstein2023selfguidance,
  title={Diffusion Self-Guidance for Controllable Image Generation},
  author={Epstein, Dave and Jabri, Allan and Poole, Ben and Efros, Alexei A. and Holynski, Aleksander},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

Acknowledgements

We thank Oliver Wang, Jason Baldridge, Lucy Chai, and Minyoung Huh for their helpful comments. Dave is supported by the PD Soros Fellowship. Dave and Allan conducted part of this research at Google, with additional funding provided by DARPA MCS and ONR MURI.

draw_abstract Abstract

palette Results

open_with Move and resize objects

open_with Appearance transfer from real images

open_with Real image editing

open_with Sample new appearances

open_with Mix-and-match

open_with Compositional generation

open_with Manipulating non-objects

open_with Limitations

format_quote Citation

Acknowledgements

draw_abstract
Abstract