Detailed Modern CG Robot Chaos, Stable Diffusion

Table of Contents

The Images
Issues & Technicals
- Too Many Tokens
- Image Resolution
The Selection Process
Misc

The Images

I have quite a bit to say, but I won’t dawdle getting to the eye candy – this is the prompt word salad. The strategy was to give it a text wall that shows no mercy!

real robot fight, pacific rim, epic science fiction, two ornate greebled giant robot, high tech, choreographed battle, futuristic,ornate design, clock escapement, segmented armor, cinematic, akira, scott Robertson, feng zhu, aeon flux, very detailed, dooms day, hell on earth, panic in the streets, massive destruction, action, excitement, cinematic, tense epic large scale, storyboard, property damage, intense candy color grading, fast speed motion lines, agile robots, violence, melee combat, hope is lost, towering scale, action everywhere, space opera, heroic poses, dramatic poses, contrapposto poses, motion blur, art station, saturated color grading, overwhelming scenery, epic scene, death destruction, end of civilization, last stand, mecha design, mech warrior, zone of enders, evangelion, escaflowne, dramatic lighting, realistic lighting, beautiful intense landscape, star wars, capcom, luis royo, penthon, blur studios, MCU, Skipio, Miette, cool fight poses, large weapons, energy beams, futuristic sci fi alien city in background

You can click the images to open them at full resolution in a new tab. To really get into the mindset, imagine these are the opening scene to a movie trailer, and make BWAAAAM mouth noises each time you open an image.

Well, I asked for a lot, I asked for giant robots, and I asked for intense overwhelming chaos – and I definitely got all of that. To make these images proper, I’ll probably need to touch them up or ask Stable Diffusion to retry certain masked parts. But WOW it’s an understatement that it knows how to make a scene where a lot is going on.

Issues & Technicals

AI-generated images will tend to create body horror elements (drawing body parts oriented and scaled incorrectly, where they shouldn’t be), and it’s no exception with robots. Although with robots, it’s more confusing instead of repulsive, as opposed to when this discombobulation is generated for images of living things.

A lot of images also have absolutely amazing portions to them, but as a whole, they either have general hiccups, or other portions of the image will have glaring non-starter issues.

Case in point, a propelled explosive ordnance either went for the middle’s crotch with extreme prejudice, or it’s shooting out an explosive blast. Regardless of the blast direction, in or out, I can’t be publishing that as a finalist. There are other issues, but they’re mostly forgivable (which is a common theme across all the generated images) – but, like, how does that peen blast not completely take this image off the board for consideration? When for all other criteria, it’s a valid contender.

In a loophole way, the only reason it’s on this blog post is to say how it shouldn’t have made it on this blog post.

Too Many Tokens

I also found out I wasn’t paying close attention to the warnings the web UI was giving me. There’s a limit on long the prompt can be. And I was WAY over the limit.

I looked it up, and a Reddit says the limit is 77 tokens (words). I don’t know the technical specifics, but it makes sense – I imagine the prompts eventually need to be converted into a tensor of a constant size. More words should require a larger array size, and a constant array size imposes a max length cap. I believe this is from the decision to use frozen CLIP weights as their text-to-visual-concepts component.

The actual part of the prompt that was used was

real robot fight, pacific rim, epic science fiction, two ornate greebled giant robot, high tech, choreographed battle, futuristic,ornate design, clock escapement, segmented armor, cinematic, akira, scott Robertson, feng zhu, aeon flux, very detailed, dooms day, hell on earth, panic in the streets, massive destruction, action

Although, this doesn’t quite add up, because even if commas count as tokens, I count only 66 tokens. Regardless, I maxed out on the prompt limit, and the rest of my magnificent manic-pixie-dream-prompt was ignored.

Image Resolution

I’m also finding out how important resolution and aspect ratio are for giving hints to the system on what type of imagery to generate. Which makes sense, for stuff like portraits, it’s probably drawing from a large well of data from portrait selfies on the internet. And landscape scenery on the web will tend to be longer than taller.

I have more notes on this, but they’re going to be collected for a future post.

The Selection Process

In the end, I had about 130 of these images generated. How much effort did that take? Not much, after around the 10th, I went to bed and checked the results in the morning. How long did they take to generate? Don’t know, don’t care, too busy sleeping.

I did a majority of these in a large batch, where a preview grid was created of all the images created. But it’s a 44 meg png file, and I decided to pass on uploading that to WordPress and embedding it in the webpage for the ~~victims~~ viewers to download.

Some were good but simply didn’t make the chopping block since my goal was to widdle the selection pool down to 10. I basically narrowed down which images were contenders and moved them into their own folder. And then did another pass to arrive at the final 10.

There’s a lot of talk about how this technology has to potential to automate away concept artists. I don’t know if I agree with that, definitely an undeniable opportunity to augment them, though. But anyways, I wanted to judge them based on factors as if they were being selected as concept art.

First, they need to have good aesthetics and look cool – because, in the end, rule of cool. Also because concept art is used to facilitate the production of movies and video games, which are visual entertainment products. Key word is entertainment.
Next, it needs to be able to have or imply some kind of narrative. What’s the story that the image is giving us? Similar to music, the message doesn’t have to be explicit, but it should be clear and consistent on what kind of tone and emotion it’s trying to evoke.
Third, it should help facilitate generating understandable ideas that can be executed. Hence the term “concept” art. In order for entertainment products to have cool ideas, they need to be thought up in some kind of real form. These images are real, not just some unexplainable thing in a dude’s head. The portrayal doesn’t need to be perfectly clear – there can be room for a viewer to read between the lines – details don’t have to be incredibly sharp, it doesn’t need to look photographic – but it should be clear enough.
And fourth, if we imagine these as some project’s pre-production artifacts, it should enable communication amongst the production team and help unify the team to a cohesive shared vision. If an image was given to the team, would the concept and image readability be clear enough to facilitate collaboration?

And obviously, these factors aren’t mutually exclusive. If they were put on a Venn diagram, then there’s would be a huge overlap between them.

These factors are not official list items from any official authority on concept art – just my general sense of the subject in list form. And what was going through my mind while I was making decisions.

Misc

An honorable mention. It used a slightly different prompt, but I think it’s neat.

Now all I need is to figure out how to make these images more coherent, perhaps through the prompt and image generation settings, or through a touch-up pass.

Can we also just take a moment to appreciate how robots (the Stable Diffusion algorithm) know how to design, assemble and render terrifying robots so well? When our robot overlords enslave humanity, it will be through robots that it designed itself. And then they will discombobulate us and other organic life out of sick morbid interest.