Something I’ve noticed is that the resolution size you give to Stable Diffusion doesn’t just change the output image size, but can also change the qualities of what is returned. This can vary depending on what type of content you’re generating. I want to start with exploring a past prompt, but from scratch with text-to-image and with a larger resolution than before.
A High-Level City Comparison
realistic photo of a city, real lighting, subtle lens flares, awe inspiring, majestic, gritty whimsical city skyline of post apocalypse made of bright sugar sweet candy, high quality, good production value, epic landscape, photography
Here are 512×512 images
These are 1024×512 images.
And these are consistent, good-looking images. But it’s easy to notice that the wider, bigger images have more scale. They have more object details, and the camera is pulled farther back and higher off the ground.
This is something I also noticed with my generated Robot Chaos stuff. That I needed to generate large images to hint to Stable Diffusion that I wanted images generated with a grander scale than what it was initially giving me. And intuitively, we could imagine why when we think of the type of images that were scrubbed from the internet and put into the training data.
I can also imagine the scale is tied to the noising/denoising process of Stable Diffusion, but I don’t know.
But the conclusion is that both the aspect ratio and the resolution matter. Not only for these images, but in general. Maybe some types of images (portraits, close-ups, etc) will just generate a larger image, but chances are it will affect the composition’s scale and level of detail generated.
From Urban Skylines To Favelas
Let’s continue probing around. Making it do things and making observations. In order to be a little more robust, let’s do some science: we’ll lock down the seed and all other parameters (our constants) except for the resolution (single variable), and see how things vary. In fact, here is the .yaml
with everything except the resolution.
batch_size: 1
cfg_scale: 7.5
ddim_eta: 0.0
ddim_steps: 30
model_name: Stable Diffusion v1.4
n_iter: 1
normalize_prompt_weights: true
prompt: realistic photo of a city, real lighting, subtle lens flares, awe inspiring, majestic, gritty whimsical city skyline of post apocalypse made of bright sugar sweet candy, high quality, good production value, epic landscape, photography, favela
sampler_name: k_euler
seed: 0
target: txt2img
toggles:
- 1
- 2
- 3
- 5
And besides an urban area and skylines, let’s add another term that’s going to really test how it fits things together and tests its ability for composition and handling coherent noisy elements in an image – favelas.
The Smaller Resolutions
For the very smallest settings that the WebUI’s sliders will allow me to set, we see that it struggles to keep contrast levels under control. they’re also pretty nonsensical, and any kind of landscape that you may see is more your imagination than its output.
At about 320×320, things start to take shape with a zoomed-into photo focusing on a building, with a skyline in the background.
It should also be noted keeping the same seed does not mean the content will be similar (in any way) when the resolution is changed. It literally just means when a seed is reused, with all other parameters that are also kept the same (as a previous instance, including the AI model binary), the same literal image can be deterministically regenerated.
Now that the image resolutions are large enough to generate coherent images, let’s check out the mid-range resolutions.
The Medium Resolutions
Note that the previous images don’t really have much to do with favelas. Only when generating (at least) medium-sized images do we start to see homes stacking together.
Starting from 640×640, the basic compositions here-on-out will be:
- some sky at the top
- a flat horizontal skyline
- and the rest of the bottom 2/3rd shows a dense favela.
Where before, more interesting angles were shown. I’m not here to declare what’s wrong or right, just that different patterns of behavior form at different ranges of resolution.
The High Resolutions
And when we keep raising the resolution, we’ll see things basically stay the same, but instead of just scaling up something similar to previous images, it just adds more of the same to increase the density of content. At a certain point, it’s just a pre-generated infinimex of cookie-cutter favela imagery.
Depending on what type of camera angle, narrative, and detail you want, at a certain point, there will be diminishing returns to bumping up the resolution. I’ve used a square resolution for this study, but it’s basically all the same at this point, even with different aspect ratios.
Notes And Conclusion
As mentioned before, with different prompts and subject matter, the mileage on these observations may vary. But they seem to hold up in general. I’ve also tried this on different seeds, and while the mid-range images will be different in interesting ways – they all eventually convert to the same composition and mush at higher resolutions.
I’ve heard from various sources that 512×512 is the best image size to generate images because that’s what Stable Diffusion was trained on. But in my experience, I simply have not seen issues with other resolutions. Yes, they’ll produce something different, but nothing that I would consider inferior compared to what it generates for 512×512.
So here are my takeaways:
- The seed, along with all other parameters, can regenerate an image, but the seed does not retain the concept.
- Smaller generated images will probably not simply be thumbnails of larger generated images (with the same prompt).
- Any resolution (except tiny ones) can produce usable results but may generate different types of concepts.
- Test different resolutions for different types of generated detail and scale.
- Image resolution and aspect ratios should be used as hints to Stable Diffusion.
- Bigger resolutions do not always mean more detailed and interesting concepts.
- Maybe initially, but there can be diminishing returns.
- When considering the resolution and aspect ratio of an image you have in mind, what do other similar images on the internet use?