Header Image

CS180: Project 5A - Fun with Diffusion Models!

Adnan Aman

Project Overview

In this project, we explore the capabilities of diffusion models through various implementations including sampling loops, inpainting, visual anagrams, and hybrid image generation. Using the DeepFloyd IF model, we demonstrate advanced image manipulation techniques and creative applications.

Part 0: Setup and Initial Generation

Using DeepFloyd IF model with different prompts and inference steps.

Random seed used: 180. Comparing generation quality between 20 and 30 inference steps.

Man Wearing Hat - Inference Step Comparison

Prompt: "a man wearing a hat" (20 inference steps)
Same prompt with 30 inference steps - Notice improved detail quality

Snowy Mountain Village - Inference Step Comparison

Prompt: "an oil painting of a snowy mountain village" (20 inference steps)
Same prompt with 30 inference steps - Notice improved detail quality

Rocket Ship - Inference Step Comparison

Prompt: "a rocket ship" (20 inference steps)
Same prompt with 30 inference steps - Notice improved detail quality

Part 1.1: Forward Process

Implementation of the forward process adding noise to images at different timesteps.

Original Campanile
Noise Level t=250
Noise Level t=500
Noise Level t=750

Part 1.2: Classical Denoising

Using Gaussian blur filtering for denoising at different noise levels.

Original Noisy (t=250)
Gaussian Denoised (t=250)
Original Noisy (t=500)
Gaussian Denoised (t=500)
Original Noisy (t=750)
Gaussian Denoised (t=750)

Part 1.3: One-Step Denoising

Single-step denoising using UNet predictions at different noise levels.

Original Image
Noisy Image (t=250)
One-Step Denoised (t=250)
Original Image
Noisy Image (t=500)
One-Step Denoised (t=500)
Original Image
Noisy Image (t=750)
One-Step Denoised (t=750)

Part 1.4: Iterative Denoising

Progressive denoising showing all steps and comparison with other methods.

Denoising Steps

Step 10
Step 15
Step 20
Step 25
Step 30

256x256 Resolution Comparison

Original (256x256)
Noisy (256x256)
Iterative Denoised (256x256)
One-Step Denoised (256x256)
Gaussian Denoised (256x256)

Part 1.5: Diffusion Model Sampling

Generating images from scratch using DeepFloyd model.

Generated Sample 1
Generated Sample 2
Generated Sample 3
Generated Sample 4
Generated Sample 5

Part 1.6: Classifier Free Guidance

Image generation with CFG scale γ=7

CFG Generated Image 1
CFG Generated Image 2
CFG Generated Image 3
CFG Generated Image 4
CFG Generated Image 5

Part 1.7: Image-to-image Translation

Using the SDEdit algorithm to project images back onto the natural image manifold with different noise levels. Using prompt "a high quality photo" for base projections.

Test Image (Campanile)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Campanile

Web Image (Dog)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Image (Web - Dog)

Web Image (Man)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Image (Web - Man)

Part 1.7.1: Editing Hand-Drawn and Web Images

Projecting non-realistic images onto the natural image manifold using different noise levels.

Web Image (Optimus Prime)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Web Image (Optimus Prime)

Hand-Drawn Image 1 (Smiley Face)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Hand Drawing (Smiley)

Hand-Drawn Image 2 (Optimus Prime)

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Hand Drawing (Optimus)

Part 1.7.2: Inpainting

Using RePaint algorithm to fill in masked regions of images while preserving surrounding context.

Test Image Inpainting (Campanile)

Original Campanile
Inpainting Mask
Inpainted Result

Golden Gate Bridge Inpainting

Original Golden Gate Bridge
Inpainting Mask
Inpainted Result

Cow Inpainting

Original Cow Image
Inpainting Mask
Inpainted Result

Part 1.7.3: Text-Conditional Image-to-image Translation

Using text prompts to guide the image projection process.

Campanile to Rocket Ship

Prompt: "a rocket ship"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Campanile

Woman to Man

Prompt: "a photo of a man"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Woman Image

Forest to Snowy Mountain Village

Prompt: "an oil painting of a snowy mountain village"

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Original Forest Image

Part 1.8: Visual Anagrams

Creating optical illusions that change appearance when flipped upside down using paired prompts.

Oil Painting Old Man / Campfire

Normal Orientation ("an oil painting of an old man")
Flipped ("an oil painting of people around a campfire")

Lithograph Waterfall / Skull

Normal Orientation ("a lithograph of waterfalls")
Flipped ("a lithograph of a skull")

Mountain Village / Amalfi Coast

Normal Orientation ("an oil painting of a snowy mountain village")
Flipped ("a photo of the amalfi cost")

Part 1.10: Hybrid Images

Creating images that appear different at varying distances using frequency separation.

Skull and Waterfall

Low Frequency: "a lithograph of a skull" (visible from far)
High Frequency: "a lithograph of waterfalls" (visible up close)

Mountain Village and Amalfi Coast

Low Frequency: "an oil painting of a snowy mountain village"
High Frequency: "a photo of the amalfi cost"

Rocket Ship and Amalfi Coast

Low Frequency: "a rocket ship"
High Frequency: "a photo of the amalfi coast"

Conclusion

Through this project, we've explored various applications of diffusion models including:

The results demonstrate the versatility and power of diffusion models in various creative applications and image manipulation tasks.

Part B: Training Your Own Diffusion Model

Following our exploration of DeepFloyd IF, we now implement our own diffusion models through three progressive stages: unconditioned, time-conditioned, and class-conditioned UNet architectures.

Part 1: Single-Step Denoising U-Net

We begin with a simple UNet architecture that performs one-step denoising.

Visualization of different noise levels on MNIST digits
Training loss curve for unconditioned UNet

Training Results

Results after first epoch
Results after fifth epoch

Out-of-Distribution Testing

Performance on different noise levels (σ = 0.0 to 1.0)

Part 2: Time-Conditioned UNet

We enhance our model by adding time conditioning, allowing the network to handle different noise levels more effectively. This is achieved by injecting timestep information through FCBlocks at key points in the UNet architecture.

Training algorithm for time-conditioned UNet
Sampling algorithm for time-conditioned UNet
Training loss curve for time-conditioned UNet

Sampling Results

Generated samples after 5 epochs
Generated samples after 20 epochs

Part 2.5: Class-Conditioned UNet with Classifier-Free Guidance

The final UNet adds class conditioning with classifier-free guidance (γ=5.0). This allows us to generate specific digits with improved quality. For each digit (0-9), we generate four different instances to demonstrate the model's ability to produce clear digits similar to MNIST dataset.

Training loss curve for class-conditioned UNet

Class-Conditioned Sampling Results

Each row shows four different generations of the same digit (0-9), demonstrating both consistency in digit identity and variation in style.

Epoch 5: Four samples of each digit (0-9).
Each row represents a digit, each column is a different sample.
Epoch 20: Four samples of each digit (0-9).
Note the improved clarity and consistency compared to epoch 5.

Conclusion

Through this project, we've explored the evolution of diffusion models from basic denoising to conditional generation: