AI Image Generation Explained: From Diffusion Models to Stable Diffusion

Table of Contents

Introduction
#

AI image generation has been blowing up lately. From Midjourney to Stable Diffusion, the quality of generated images has reached a point where you can’t help but be amazed. But I’ve always been curious: what’s actually going on under the hood? How can an AI “paint” a picture from nothing? I spent some time digging into Diffusion models, UNet, CLIP, and the other core components, and the whole tech stack turned out to be more elegant than I expected. This post is my attempt to lay it all out, from the underlying principles to practical concerns like style control and image guidance.

Diffusion Models: The Ink-in-Water Story
#

The foundational technique behind AI image generation is called the Diffusion Model. The core idea can be explained with a simple ink metaphor¹.

Imagine a glass of clear water, representing a sharp, original image.

Step one: keep dripping ink into the water. One drop, two drops, three drops… the water gets murkier, and the image gets blurrier. After enough drops, you’re left with pure ink, and the image has become a screen full of noise. This process is called the forward diffusion (adding noise).

Step two: reverse it, and extract the ink from the water. Pull it out little by little. The noise on the screen fades away, the blurry image gradually sharpens, and eventually the ink is gone, and a brand-new clear image is born. This is called reverse diffusion (denoising).

The core logic of diffusion models boils down to these two steps:

Training phase: The model learns the “adding ink” process, figuring out what an image looks like after a certain number of noise steps, and building up a knowledge of noise patterns.
Generation phase: The reverse operation. Starting from a pure noise image, the model pulls the “ink” out step by step until a brand-new image emerges.

So the AI isn’t painting from a blank canvas. It’s “recovering” a painting from a mess of noise. This idea originally comes from non-equilibrium thermodynamics in physics, and was later introduced to generative modeling by Sohl-Dickstein and colleagues².

UNet and CLIP: The Lead Engineer and the Translator
#

Now that we understand the core idea of diffusion, let’s meet two key players.

UNet: The Lead Engineer on the Assembly Line
#

UNet is a U-shaped image processing network responsible for predicting “how much ink to extract” at each step[^3].

It works like an assembly line: the encoder compresses the image and captures its key features, while the decoder gradually restores a clear image. A skip connection runs between them, acting as a memory channel that ensures information isn’t lost during compression and restoration.

Without UNet, the diffusion model wouldn’t know how much to denoise at each step, and the whole generation process would grind to a halt.

CLIP: The Translator That Turns Human Language into Machine Language
#

UNet can do the heavy lifting, but it only understands numbers, not text like “a dog wearing a hat.”

That’s where CLIP comes in. Trained on massive amounts of image-text pairs, CLIP aligns text and image meanings into the same “language”[^4]. After encoding, the text “a dog wearing a hat” ends up as a feature vector that’s very close to the encoding of an actual photo of a hat-wearing dog.

CLIP translates your text prompt into feature vectors that UNet can understand, and UNet uses those features to guide denoising. These two components work together to make “draw what I say” possible.

Stable Diffusion’s Brilliant Optimization: Latent Space Diffusion
#

Early diffusion models had a fatal flaw: performing noise addition and removal directly in pixel space required enormous computation and was painfully slow. A high-resolution image can easily have millions of pixels, and processing all that data at every step meant waiting a long time for a single generated image.

Stable Diffusion’s solution is remarkably clever[^5]:

First, train a variational autoencoder (VAE) that compresses high-resolution images into small, low-dimensional latent features. Think of it as condensing a bucket of raw material into a small vial of concentrate.
All diffusion processes (adding and removing noise) happen in this tiny “concentrate” space.
The “image” that UNet processes becomes many times smaller, drastically reducing computation.
In the final step, the decoder “decompresses” the processed concentrate back into a high-resolution image.

The result: image generation speed increases dramatically. This is why Stable Diffusion has become the dominant open-source solution for AI image generation.

Here’s a table that ties the whole text-to-image tech stack together:

Component	Analogy	Function
Diffusion Model	Core production line	The “mess it up, then restore it” learning logic
UNet	Smart robot on the assembly line	Purifying the concentrate step by step
CLIP	Translator	Turning user prompts into codes the machine understands
Attention Mechanism	Walkie-talkie	Letting the translator tell the robot what to focus on at each step
Latent Space Diffusion	Compressing raw materials into concentrate before processing	Saving time and effort

Style Specification: Are Keywords Enough?
#

With the principles covered, let’s look at a practical concern.

Almost every AI image generation model supports adding style keywords to prompts: cinematic, cyberpunk, anime, ink wash, oil painting, and so on. Some products even turn common styles into dropdown menus so users can pick directly.

The problem is: even when you specify a style, different models interpret the same keyword very differently. Write “cyberpunk” and one model generates something manga-inspired, while another leans toward photorealism. Plus, details you describe in the prompt (facial features, expressions, clothing, poses) can interfere with the final style.

If your use case demands a consistent art style across images (brand assets, financial product marketing images, etc.), relying on keywords alone won’t cut it.

Fine-Tuning: Locking the Model into a Specific Style
#

The solution is to fine-tune the base model, anchoring it to a particular style.

On AIGC content platforms like LibLib AI, you can find plenty of open-source fine-tuned models (typically LoRA). Search for a specific style (cyberpunk, for instance) and you’ll find a range of models fine-tuned on top of different base models (Flux.1, SDXL, etc.).

A few tips when using them:

Check the version info and the author’s notes on the model page.
Include the fine-tuned model’s “trigger” keyword in your prompt to ensure it generates in the fine-tuned style.
If you’re not confident in publicly available models, you can prepare your own training dataset with images that match your exact style requirements, and fine-tune a model yourself.

With a properly fine-tuned model, the consistency of generated results is generally reliable.

The Limits of Style Transfer
#

Style transfer is another common use case: turning a photo into a cartoon-style avatar, or changing the art style of an already-generated image.

The approach is straightforward: provide the original image along with a style transfer prompt to a multimodal image generation model.

But be realistic about its limitations:

Results are decent for portrait-style images.
Complex, large-scale scenes with many elements don’t transfer well. The more complex the composition, the harder it is to preserve fine details.
Obvious artifacts and detail breakdowns are common.

Style transfer is best treated as an assistive tool, not a reliable production pipeline. I’d recommend testing it across different art styles, composition complexities, and subject types to understand where it breaks before committing it to any production workflow.

Image Control: The Core Challenge from a Product Perspective
#

Three Pain Points for Users
#

Pain point one: the prompt barrier is too high. When users describe what they want in natural language (“cute anime girl, big eyes, twin tails, smiling, soft lighting”), the results are often disappointing. Current visual generation models simply don’t understand and execute natural language with enough precision. Users end up having to adapt to the model, stuffing prompts with keyword after keyword. There are also hidden tricks most users don’t know about, like the fact that “4K quality” is more likely to be recognized by models than “high-definition quality.”

Pain point two: the vicious cycle. Users keep piling on keywords, but the output still feels like opening a mystery box. They end up repeatedly “pulling the lever” and hoping for a better result. The experience is frustrating.

Pain point three: some things can’t be described in text at all. You see a landscape photo with an incredible atmosphere and want to generate a portrait in that style, but nebulous qualities like “atmosphere” and “vibe” are almost impossible to put into words. In design-heavy scenarios like financial product marketing images, this is even more true. Users need reference images; pure text generation simply won’t work.

The Challenge for Product Thinking
#

All three pain points point to the same challenge: how do we make the AI image generation process more controllable?

Specifically, product managers need to think about:

How to lower the barrier to writing high-quality prompts? (Smart prompt suggestions, style preset templates, etc.)
How to help users turn “a feeling they can’t quite describe” into effective generation instructions? (Image-to-image, style reference uploads, etc.)
How to increase controllability without sacrificing creative freedom?

Image control is the area that product managers working on generation tools must focus on. Whoever solves this problem well will have a serious edge in the AI image generation product space.

AI image generation has evolved from “looks like something” to “looks good.” The next frontier is “looks exactly how I want it.” And in that direction, there are still plenty of opportunities.

References
#

The foundational diffusion model paper was published by Ho et al. in 2020, titled “Denoising Diffusion Probabilistic Models,” which established the theoretical basis for modern AI image generation. ↩︎
Sohl-Dickstein et al.’s paper “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” first introduced nonequilibrium thermodynamics to generative modeling. ↩︎

Introduction #

Diffusion Models: The Ink-in-Water Story #

UNet and CLIP: The Lead Engineer and the Translator #

UNet: The Lead Engineer on the Assembly Line #

CLIP: The Translator That Turns Human Language into Machine Language #

Stable Diffusion’s Brilliant Optimization: Latent Space Diffusion #

Style Specification: Are Keywords Enough? #

Fine-Tuning: Locking the Model into a Specific Style #

The Limits of Style Transfer #

Image Control: The Core Challenge from a Product Perspective #

Three Pain Points for Users #

The Challenge for Product Thinking #

References #