Copyright infringement by way of ControlNet, IP-Adapter, and refined taste
Pink Floyd’s “Back Catalogue” poster has a great composition:
1990s Sailor Moon has a great aesthetic:
What if we combined the two?
We want to recreate the Pink Floyd poster with Sailor Moon characters in a retro 1990s anime style. Stable Diffusion allows us to achieve this in a controllable way.
There are three problems we must solve:
We want our final image to have five characters sitting along an edge with poses identical to the reference image. To achieve this we can use ControlNets.
One Controlnet (Canny) will be used to control the edges of our generated characters and environment, and the other (Depth) will be used to control the depth of generated elements (i.e. foreground, midground, background).
To begin, we preprocess our source image to produce two control images which will later be used to guide our ControlNets:
The control image produced by the Canny Edge preprocessor effectively highlights the edges of our input image:
The control image produced by the MiDaS Depth Map preprocessor creates a reasonably accurate depth map of our input image:
These two control images get fed into their associated ControlNets:
We can now be confident that our generated images will have a similar composition to our source image.
Sailor Moon is widely known in popular culture and is thus well-represented in the base SDXL model. This means we do not need to train a model to understand what we mean by “Sailor Moon characters” (although we could if we wanted to).
Instead, we will get most of the way through prompting:
six beautiful sailor moon characters wearing frilly dainty dresses sitting side by side at the edge of a japanese wooden onsen hot pool outdoors with their legs in the water, sailor moon, sailor neptune, sailor uranus, sailor pluto, sailor jupiter, sailor mars, sailor venus, long hair, short hair, textured hair, pigtail hair, ponytail hair, buns in hair
It is not perfect, but we clearly have Sailor-Moon-like characters in our desired composition.
We now want to apply a retro anime aesthetic to our generated images. It is simple to obtain reference screenshots from retro anime, so our goal is to use these reference screenshots to guide our generated imagery. To achieve this we can use an Image Prompt Adapter (IP-Adapter).
IP-Adapter allows us to treat input images as “visual prompts” for image generation. In addition to steering our imagery towards the desired aesthetic, this approach will also make our characters more Sailor-Moon-like.
The choice of reference imagery has a big effect on the generated imagery. We can use screenshots from different eras of animation to achieve different aesthetics:
We are now ready to generate candidate images.
As we will be generating thousands of candidate images, we can use dynamic prompts to add variety to the generated images. With each generated image, only one of the prompt options will be chosen.
We will vary the clothing worn by the characters, the environment they are in, and the lighting:
wearing frilly dainty dresses|wearing modest bathing suits
japanese wooden onsen hot pool outdoors|high school swimming pool indoors|japanese garden pond outdoors with sky above|wooden dock at the oceanfront
golden hour lighting|sunset lighting|sunrise lighting
We leave this running for a few days and go digging for our favorite image:
We have a winner, but a new problem has emerged (common with Stable Diffusion) - the faces and the hands of our characters are deformed:
Next, we will fix that.
We are happy with most of our generated image, we only wish to replace the faces and arms. This can be achieved with the use of inpainting (i.e. generating new imagery in selected areas of an existing image).
To automatically determine the location of faces and fingers for masking, we will use Bounding Box Detectors (BBOX). BBOX will automatically determine the location of faces and fingers, mask out those areas, and generate new imagery in only the masked locations:
Putting it all together, we get our final poster: