Japanese Citypop Landscapes via Style Transfer

Creating imagery in genres that do not exist

The Idea

I was scrolling through my Instagram feed one morning and was presented with a series of Japanese citypop images:

Beautiful images, but they are all of urban environments. What would landscape images look like in this style?

The Tools

The Process

We want to take an established painting style and create images with content that is not typically found in that style. To achieve this, we will do the following:

  1. Figure out a prompting style which recreates our reference images
  2. Train a Stable Diffusion LoRA on the Japanese citypop style
  3. Modify the content of our prompt to give us landscape scenes in our trained style

1. Prompt Design

We will start by getting as close as we can to our desired style using the base SDXL model. The base model was trained (at great expense) on a wide variety of styles and techniques, we should take advantage of this if we can. This approach has the benefit of easing the demands on our LoRA training, which will only be needed to “finish the job”.

After some experimentation, we find a prompting style which works reasonably well:

an acrylic painting on canvas of a small silhouetted private jet flying over a large city from high altitude at sunset, buildings lit up, moody, dramatic, metropolis, masterpiece, flat application, minimal brushwork, flat color blocks, sharp clean lines, vibrant pastels, retro 80s aesthetic, acrylic look, no brush marks, graphic quality, stylized architecture, exaggerated shadows, pop art influence, art deco elements, clear composition, minimalistic, minimalism, simple, japanese citypop style

With the prompt figured out, we can move on to figuring out what sampler and CFG value works best.

We create a grid to investigate our options:

Based purely on aesthetics, we will go with DPM++ 2S Ancestral as the sampler with a CFG value of 7.5. The choice of an ancestral sampler (where noise is injected at each step, causing the image to change from one step to the next) is unusual, but will work fine for our purposes.

2. Style Training

The success of LoRA training is dependent upon the quality of the data which is used for training. Images must be of high-resolution, representative of what you are training for, and their captions must be highly detailed and reflective of the prompting style you plan to use with the finetuned model.

The style we are interested in is primarily the work of Japanese artist Hiroshi Nagai. It is simple enough to gather a collection of high-resolution reference images of his work:

We then create detailed captions for each image in the following format (referring to our desired style as ‘japanese citypop style’):

an acrylic painting of a curved pool with still water, ladder entering the pool, pool chairs and a table surrounding the pool, spherical lamp in foreground with rose bush, green bushes with yellow flowers with tall tropical trees in the midground, ocean in the background with breaking waves, gradient blue sky, motel to the left with a fence in front of it, pastel colors, calm, serene, luxury, japanese citypop style

With our training data prepared, we move to Kohya’s GUI to perform the LoRA. After considerable experimentation, the following settings are found to work best for our needs:

With 40 training images, 20 repeats, a batch size of 1, and 15 epochs (saving a checkpoint at each epoch) we will train for a total of 12,000 steps (with a checkpoint saved every 800 steps). This will give us a wide cross-section of checkpoints with varying degrees of training. It is not necessarily the case that more training produces a better result, overtraining is always a risk.

Once we have our 15 epochs, we need to decide which one works best. To do this we generate a grid of test images using the prompting style we developed earlier:

an acrylic painting on canvas of an alaskan winter wilderness scene, frozen river in a forest with snow covered evergreens, denali in background, bright sunny day, cold, masterpiece, flat application, minimal brushwork, Flat color blocks, sharp clean lines, vibrant pastels, retro 80s aesthetic, acrylic look, no brush marks, graphic quality, stylized architecture, exaggerated shadows, pop art influence, art deco elements, clear composition, japanese citypop style

Overtraining artifacts first become visible with epoch 7. Epoch 4 produces the best results.

We can already start to see the answer to our question (what would landscapes look like in a japanese citypop style?), but let’s explore this question further now that we have arrived at our final model.

The Result

With our trained LoRA, tuned parameters, and custom prompt style we can now reliably generate images in our desired style. We can generate different landscape environments by changing only the content portion of our prompt:

an alaskan winter wilderness scene, frozen river in a forest with snow covered evergreens, denali in background, bright sunny day, cold

a vast empty desert landscape, steep cliffs, mesa, bright sunny day, hot

a thick jungle in africa, hot

BONUS: Validation

It turns out that Hiroshi Nagai did infact paint landscapes, so we can verify our work:

Our results seem to be a reasonable approximation!