So I’ve been playing around with Stable Diffusion recently. In the past, I’ve followed miscellaneous Two Minute Papers videos, here and there, that have shown the progression of these technologies over the years. But I’ve never really taken the time to get hands-on with what’s out there. After seeing a cheeky Bob Ross parody by Mental Outlaw covering Stable Diffusion, it piqued my interest. Especially with all this recent buzz around Stable Diffusion, Dall-E 2 and Midjourney.

It’s astonishing how low the barrier to entry is for running Stable Diffusion locally. Not only is it free and open source, but it also has multiple web-UI front-ends – where it runs a local server, and then you interface with the system and modify parameters through the web browser.
Although I constantly find myself wishing all the features of two front-ends were simply merged into a single omni-package.
It’s as simple as installing updated GPU drivers, making sure you have Miniconda installed, pulling the repo from Git, signing up to the Hugging Face website, and double-clicking a *.bat file (or running a *.sh file for you Linux folks). A handful of steps, but no great legal, monetary, or technical barriers.

I didn’t have enough GPU memory with my previous GTX 1060 3GB that I got half a lifetime ago for gaming. So I got a GeForce RTX 3060 with 12 GB. It put a small hole in my wallet, but at least I wasn’t purchasing it 1 or 2 years ago when GPU prices were insane. Plus, with any luck, may this GPU last me another half a lifetime.
When purchasing this, I thought it was pretty nice how GPU prices were back down. But then I remembered I’ve lost many times that converted dollar amount of the GPU purchase cost from the devaluation of Bitcoin. So, in the big-picture of things, as always, my feelings are a mixed bag.

The Current Public State of The Art

There’s a lot of discussion and notes about “what makes the perfect text prompt.” And there’s a lot of stunning AI-generated imagery on the information super-highweb – most of which I arguably won’t be able to compete with. But it’s been a fun journey, and I’ve found some interesting results.

And with these text-to-image prompts, make no mistake, it is possible to crank out absolute unusable garbage. Getting good images can be very hit-or-miss. It’s also dependent on the text prompt, which is somewhat of a black art. An unspoken truth is that what we see displayed on the internet (regarding AI-generated images) is the best-filtered version of what’s generated. It’s a numbers game. But at the rate and ability that these systems can create content, numbers are on its side.

And we should recognize there will be limitations on its creativity – built into the system and from the input it was trained on, and just the method itself. But its limitations usually matter very little – because it’s very useful and very impressive.

More information on the topics:

Generating Sketched Cities

So we’ll start with the prompt, and I was in the mood to go HAM on eccentric concepts.

gritty whimsical city skyline of post apocalypse made of bright candy

And we get this, erm… piece of work.

So, I told Stable Diffusion it did a nice job, gave it a pat on the head, and stuck its work on the fridge with a magnet. E for effort! (In the Latin alphabet, that’s one grade below a D)

They say more details in the text prompt help. And a strategy I’ve seen others often employ is to add artist names. And commas, I see a lot of commas being used.

gritty whimsical city skyline of post apocalypse made of bright sugar sweet candy, high quality artwork, good production value, stjepan sejic, ashley wood, concept art, epic landscape

I ran 5 batches of 5 images – which created a nice preview grid.

Well, nothing in the generated images is explicitly made of candy, but there are definitely some great colors and compositions going on. The linear perspective isn’t perfect, the edges consist of sketch lines, and sometimes there are random things going on – and one of them seems to have what’s reminiscent of nooses swinging from above (two across, three down). These over-saturated watercolor aesthetics, though, they’re really adding a lot ❤️.

As a side-note, the sketchy-ness and perspective wonky-ness are not necessarily issues, especially from a (general) art and concept art perspective. It primes the brain the focus on high-level concepts and can be a good aesthetic. I’ve even heard architects are trained to use squiggly lines in their renderings if they need to convey to their customers that ideas are still in flux.

If you look at very specific things and analyze them, they don’t quite look amazing. But if you just take in the aesthetics as a whole, they feel pretty decent – subjectively, one might even say it feels like professional-level work.

The aesthetics are probably hurt by the fact that I left the image generation output resolution at the default 512×512. I haven’t started toying with resolution sizes yet.

Randomly Cranking Independent Variations

The Stable Diffusion WebUI tool comes with a few tools. The flagship is undoubtedly the Text-to-Image portion. But there’s also an Image-to-Image tool. It’s similar to the Text-to-Image in that it takes a text prompt and spits out an image, but it also takes in a starting image and has controls to mask what parts the algorithm is allowed to modify.

Through a clumsy accident, I figured out you don’t even need a text prompt for Image-to-Image. If none is provided, it essentially figures out the concepts and contents of the images and creates a similar reinterpretation. This may not sound like a big deal, but for creating many variations on a concept image that has a well-solidified idea, the implications for automating exploring an idea space are crazy.

You’ll still need to figure out good Sampling Steps and Denoising Strength parameters, but that’s all the work needed. If no prompt is given then the CFG (how strictly the image generation algorithm should follow the prompt) doesn’t seem to do anything, which makes sense. For the image that has a female silhouette on the right side, these are some of the promptless variations I got. Keep in mind I can create several dozen a minute, and they’re all decent.

Depending on the image resolution and other generation settings, the generation time varies. These were relatively cheap to iterate.

Same style and basic idea, mostly the same placement of graphical elements. The silhouette has been toyed with, and in these cases, the choices don’t make sense as a concrete idea. But again, it’s dirt cheap and is good for exploring the idea space – to stir up the creative juices of the viewer. Think of it like staring at clouds and getting ideas, only you get to author the ballpark of the type of clouds you’re looking at.

Style Exploration and Curation

It’s been years since the emergence of deep learning style transfer. And it’s to the point where we can just prompt it to use a style with text instead of providing a sample image.
Although that’s not true, the people at Stability AI and others trained the model on our behalf. But from an accessibility standpoint of putting easy-to-use technology in the hands of the masses, same-same.

Since the images were very sketchy, I took a few of the previously generated images and used a prompt to request Image-to-Image to make them realistic. The output is similar to a style transfer (or arguably since I’m aiming for realism, an un-style transfer?). Elements are still being shifted, although the ideas and image composition are mostly maintained.

realistic photo of a city, real lighting, subtle lens flares, awe inspiring, majestic

I also threw in some lens flares to give it that JJ Abrams POW! that he’s always bringing to his movies.

It does a good job placing the camera at ground level and applying perspective. The composition of buildings on the left and right to visually frame the teal building in the center is also a consistent and well-executed theme.

I like this use-case in that it’s repurposed noise from the bottom into something that can create a narrative. The first two also give me sci-fi concept art vibes.

For the first generated image, maybe a horizontal sky beam, or super powerful beings fighting in the sky so fast that it’s just a blur (could be this Goku SS vs Saitama battle I keep hearing about on YouTube). And people stop on the shoulder of the road to see what’s going on; you know, those overachieving rubberneckers. And for some reason, on the right side, they have a miniature Eiffel tower replica.
Or, maybe the city planning was so shoddy that at a certain time of day, architectural death rays actually lens through several buildings across the skyline. Those poor office workers inside…

The second image seems like one of those cliches where a huge alien mothership is just above a city. And its scale is so large that its lights cut through the atmosphere and leaves a distribution of bright dots across the sky. Or maybe the planet of another civilization is crashing into Earth, and those are the city nightlights of their continent.

In the original, we can see how the blue building in a sea of red, draws the viewers’ attention and is a resting focal point of the image. In the generated images, the AI has been keen on maintaining that compositional decision.

Claimed When Generated

There’s one more thing that I think needs pointing out, which is AI-generated claims.

Above, there are two images that were previously shown, and another image from the same set on the far right. That last one is not usable enough to show (one of those that gets thrown out in that numbers game). But, can we just take a moment and point out it looks like signatures (on their top right) and a commercial stock-photo watermark?

The signatures are interesting but not an issue. It does tickle my brain a bit, though. There was artwork in the training data; almost all artwork is signed. Hence it sometimes has the gut-reaction to automate adding something resembling an artist’s signature when using an artsy “pocket” of the AI model. That’s logical enough.

Watermarks, on the other hand… The entire purpose of a watermark is to make it well known that an image in its current form is unusable for use-and-distribution except for exhibition purposes by the owner. Especially the kind that array repeatedly across an entire image. This also tickles my brain, but in a more alarming way. I can’t say anything on this one way or another, but as a programmer, we have this term called a “smell.” And this smells,… um,… smelly.