From the course: Open Source Development with the Stable Diffusion 3 API

How Stable Diffusion works

- Image generation is part of a family of products called generative AI, which learn by being fed massive amounts of data in order to achieve specific goals. But how are these models even trained to do something like generate images in multiple styles? The key to generative AI is to train the models to recognize patterns in large amounts of data, and then learn how to recreate those patterns to achieve certain results. For example, you can feed a model billions of faces and first ask it to learn whether a picture you're showing it has a face. Once a model understands the pixels that make up a face, you can eventually ask it to either generate or critique pictures of new faces based on what it knows. This is how a tool like This Person Does Not Exist works. But how do we get original image generation? These models use a couple of techniques. One of them is called generative adversarial networks, or GANs. With this technique, a version of the model generates images from pattern training and another critiques, the generated images. Another thing that you can do is add reinforcement learning, which is the key to improving the quality of these outputs. Now, this can come from adversarial models, but also from humans since they're the ultimate audience for the images the tools generate. You'll often see that tools like Midjourney use ranking pages to help the model learn when humans like certain styles more than others. The technical name for this is RLHF, or reinforcement learning from human feedback. This process can be used to train and retrain models over time to get better results. But how do they align images to prompts? Models use a process called diffusion to help them deliver images that respond to prompts. And although this is what Stable Diffusion is clearly named after, it's just one of the approaches that the model uses to generate images. Diffusion is the process of learning how to clarify a noisy image. Computers are trained on how adding noise to an image changes it, and then the model learns to reverse the process and turn noise back into images. It aligns the noise to an image based on what the person wrote in the prompt. Think about it this way: If you've ever been around during a foggy day, you know that your brain can't sometimes tell what's on the other side of the fog. As the fog dissipates, you start seeing shapes that your brain can interpret as objects. So models are doing the same thing with noise, but in reverse. Now, this technique was pioneered by OpenAI in a process called CLIP, which stands for contrastive language image pre-training. The clarifying steps are assigned a score that tells the model how likely it is that it's following a specific prompt. Stable Diffusion uses another approach called conditional flow matching. Flow matching simplifies the math by aligning diffusion transformations to mathematical patterns. This allows the model to avoid using more complex differential equations, making the transformation faster and simpler. So what's so special about this particular version of Stable Diffusion? The key updates have been the ability to create images with improved photorealism; better prompt adherence so that your prompt instructions are better aligned to a final image; better rendering of hands and faces, which are often a problem when a computer generates images; and more accurate text rendering when Stable Diffusion is instructed to add text to an image. Now, it's better if I show you instead of just telling you. So let's take a look at these in the next video.

Contents