Project Overview

2024 | Links: Original Paper | NVIDIA Code

Exploring DDGAN’s capability for fast generation and discovering semantic directions in latent space for controlled facial attribute manipulation on CelebA-HQ.

This project explores Denoising Diffusion GANs (DDGAN), a hybrid approach that combines the sample quality of diffusion models with the speed of GANs. While standard diffusion models require 1000+ denoising steps, DDGAN achieves comparable quality in just 2-4 steps, enabling near-real-time generation.

I extended the original work by building a complete latent space editing pipeline for facial attributes, training on the CelebA-HQ dataset and discovering semantic directions for controllable image manipulation.

The Generative Learning Trilemma

For generative models to see widespread real-world use, they should ideally meet three key requirements:

High-quality sampling — The outputs must look and sound natural, especially for direct user-facing tasks (e.g., speech clarity, photorealistic images).
Mode coverage and sample diversity — Successful models capture the full variety of training data, including rare examples, thus avoiding bias and missing important cases.
Fast, inexpensive sampling — Real-time and interactive applications need speedy, low-compute generation; this also reduces the environmental impact of running large neural networks.

Most existing methods excel in only one or two aspects, making trade-offs between them. Capturing all three is difficult—this is known as the generative learning trilemma. Failing on diversity leads to biased, brittle models; failing on speed limits real-time use.

Trilemma Diagram Placeholder
Illustration: The generative learning trilemma—quality, diversity, and speed.

What is DDGAN?

Denoising Diffusion GAN (DDGAN)
DDGAN is designed as a solution to the trilemma. It produces images that are visually realistic (high-quality), covers diverse modes of the data (diversity), and does so with ultra-fast generation (speed). By combining the stable denoising processes of diffusion models with adversarial GAN training, DDGAN delivers high-fidelity results in just a few steps, instead of thousands—unlocking potential for real-time editing and deployment.

How DDGAN Works

The Core Concept:

DDGAN solves a fundamental challenge: how to generate high-quality, diverse images quickly. Traditional diffusion models produce great results but are extremely slow (requiring 1000+ steps). GANs are fast but can lack diversity. DDGAN combines both approaches.

The Process:

Forward Diffusion: Start with a clean image and gradually add noise over several timesteps (in this project, either 2 or 4 steps)
Conditional Generation: The generator takes a noisy image and tries to denoise it, conditioned on the current timestep and a random latent vector
Adversarial Training: A discriminator network distinguishes between real and generated denoising transitions, forcing the generator to learn realistic outputs
Reverse Sampling: To create new images, start from pure noise and apply the generator repeatedly to denoise step-by-step

Key Innovation: Using adversarial training allows the model to handle complex, multimodal denoising in just 2-4 steps instead of 1000, achieving 600× speedup.

DDGAN Architecture The training process: Generator learns to denoise images while Discriminator ensures realistic transitions

What I Accomplished

Phase 1: Training the Face Generator

Trained DDGAN on 5,000 celebrity faces at 128×128 resolution
Ran training for 1,200 epochs over 48 hours
Used 3 NVIDIA A30 GPUs on ASU’s SOL Supercomputer
Generated 30,000 high-quality synthetic faces
Achieved generation speed of 0.025 seconds per image
Quality metric (FID): 17.16

Phase 2: Building the Attribute Classifier

Created a ResNet-18 neural network to recognize 40 facial attributes
Trained on the full CelebA dataset (202,599 images)
Attributes include: age, smile, beard, glasses, hair color, gender, etc.
Achieved over 90% accuracy on most attributes
This classifier became the foundation for controllable editing

Phase 3: Discovering Latent Directions

Generated 30,000 images and saved their latent codes
Used the classifier to label each generated image
Applied logistic regression to find “direction vectors” for each attribute
Each direction represents how to manipulate one facial feature
Created 40 different editing directions (one per attribute)

Phase 4: Interactive Image Editing

Built a system to edit faces by moving along discovered directions
Can control edit strength with a “gain” parameter
Successfully manipulated: age, smile, beard, glasses, and more
Documented limitations and entanglement issues

Technical Setup

Frameworks: PyTorch, torchvision, distributed training
Metrics: FID (Fréchet Inception Distance), Inception Score
Hardware: 3× NVIDIA A30 GPUs (24GB each)
Supercomputer: ASU SOL
Training Time: ~48 hours per configuration
Dataset: CelebA-HQ (5K subset for generation, 202K full for classifier)

Results: Generated Faces

Synthetic Celebrity Faces

CelebA Generated Grid 30,000 high-fidelity celebrity faces generated with DDGAN

Quality Characteristics:

Photorealistic facial features and skin textures
Diverse ages, genders, and ethnicities
Variety in hairstyles, expressions, and accessories
Minimal visual artifacts or distortions
Fast generation: 0.025 seconds per image

Metrics:

FID Score: 17.16 (indicates good quality for 128×128 resolution)
Speed: 600× faster than standard diffusion models
Diversity: High variance across all 40 facial attributes

The Latent Editing System

How It Works

My editing framework follows four steps:

Step 1: Generate Images with Latent Codes

Created 30,000 synthetic faces
Saved the random latent vector used to generate each image
Each latent vector is 128 numbers that encode the entire face

Step 2: Label Images with Attributes

Used the ResNet-18 classifier to analyze each generated image
Classified presence/absence of 40 attributes per image
Example: Image #5432 has “Young: Yes, Smiling: Yes, Beard: No, Glasses: Yes”

Step 3: Find Semantic Directions

For each attribute (e.g., “Smiling”), separated images into two groups: those with the attribute and those without
Used logistic regression to find the direction in latent space that best separates these groups
This direction becomes the “smile direction” that can be applied to any latent code

Step 4: Edit Images

Take any original latent code
Add the desired direction vector (e.g., “smile direction”)
Control strength with a multiplier (gain)
Generate new image from the edited latent code

Latent Editing Pipeline Complete pipeline: Generation → Classification → Direction Finding → Editing

Editing Results

Successful Manipulations

Age Progression:

Young Progression Smooth age transition from older to younger appearance

The progression shows how increasing the “young” direction smoothly transforms the face:

Original (Gain 0.0): Older appearance
Gain 1.0-2.0: Gradually younger features
Gain 3.0: Young adult appearance
Gain 4.0-5.0: Very young face

Adding a Beard:

Beard Progression Gradual beard growth from clean-shaven to full beard

Other Successful Edits:

Smiling: Transition from neutral expression to happy smile
Eyeglasses: Add or remove glasses smoothly
Hair Color: Some success changing brown to blond
Gender Features: Male to female transformations (with side effects)

Challenges: Attribute Entanglement

The Problem

Editing one attribute often unintentionally changes others. This is called “entanglement.”

Example: When Beard Editing Goes Wrong

Entangled Example Adding a beard also changes gender, adds a hat, and alters face shape

Why This Happens

1. No Guidance During Training

DDGAN was trained only on images, without attribute labels
The model organized its latent space based on natural patterns in the data
No explicit instruction to keep attributes separate

2. Dataset Correlations

In CelebA, beards appear almost exclusively on men (99%)
Blond hair correlates strongly with young, female celebrities
Certain accessories (like hats) appear more often with specific age groups

3. Post-Training Direction Discovery

Directions were found after training was complete
No guarantee that different directions are independent
Directions may overlap when attributes naturally co-occur

4. Limited Latent Space

Only 128 dimensions to encode an entire face
Must represent 40+ different attributes
Inevitable sharing and overlap between features

Entanglement Analysis

Measured Correlations:

Edit Target	Unintended Change	Correlation	Root Cause
Blond Hair	Younger appearance	+0.72	Dataset bias: young celebrities often blond
Blond Hair	Female gender	-0.68	Blond more common in women in dataset
Beard	Male gender	+0.81	99% of beards in dataset are male
Beard	Wearing hat	+0.43	Rare co-occurrence creates confusion
Eyeglasses	Older age	+0.35	Glasses more common on older faces

High correlation (>0.5) indicates significant entanglement

Key Findings

What Worked Well

✅ Extreme Speed: 600× faster than standard diffusion models
✅ High Quality: Competitive image quality (FID ~3-17 depending on dataset)
✅ Diversity: Wide variety in generated faces across all attributes
✅ Controllable Editing: Successfully manipulated multiple facial features
✅ Smooth Transitions: Gradual changes when adjusting edit strength

Limitations Discovered

❌ Attribute Entanglement: Cannot isolate single attributes perfectly
❌ Dataset Biases: Inherited demographic skews from CelebA
❌ Side Effects: Beard edits add hats, blond hair changes gender
❌ Classifier Dependency: Editing quality limited by classifier accuracy
❌ Resolution Constraints: Trained at 128×128 due to computational limits

Trade-offs Identified

2 Steps vs 4 Steps:

T=2: Faster (3× speed), good for demos, but overfits with long training
T=4: Best quality, more stable, but slower generation

DDGAN vs StyleGAN:

DDGAN: Faster generation, better diversity, less control precision
StyleGAN: Slower, more disentangled latents, better editing control

Future Improvements

For Better Quality

Train with 4 steps beyond 1,400 epochs (target: FID < 3.0)
Scale to higher resolutions (256×256 or 512×512)
Use larger network architecture with self-attention

For Better Control

Add attribute conditioning during training (not after the fact)
Orthogonalize directions to reduce entanglement
Use disentanglement losses to encourage independence
Hierarchical latent structure for coarse-to-fine control

Applications

Synthetic Data Generation:

Create privacy-preserving face datasets
Balance datasets lacking certain demographics
Generate training data augmentation

Image Editing:

Age progression for forensic applications
Virtual try-on for glasses, hairstyles, accessories
Expression modification for animation

Research:

Detect and measure algorithmic biases
Test fairness of face recognition systems
Study model interpretability

Conclusion

This project successfully demonstrates DDGAN’s capability for ultra-fast, high-quality image generation and explores the controllability of its latent space.

Key Takeaways:

Speed Advantage: DDGAN achieves excellent quality 600× faster than traditional diffusion models, making real-time generation feasible
Quality-Speed Trade-off: Using 2 steps is fastest but may overfit; 4 steps provides better final quality at modest speed cost
Latent Editing is Possible: Despite no explicit conditioning, the latent space naturally encodes semantic attributes that can be manipulated
Entanglement Limits Control: Without attribute conditioning during training, correlated features interfere with precise editing
Future Potential: Adding explicit attribute guidance during training could dramatically improve editing precision while maintaining speed

Overall: DDGAN excels at its core purpose—fast, diverse, high-quality generation. While latent editing works, achieving precise control comparable to StyleGAN would require architectural modifications or explicit conditioning during training.

References

Primary Paper:

Xiao, Z., Kreis, K., & Vahdat, A. (2022). Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. ICLR.

Code:

NVIDIA Official Implementation

Datasets:

CelebA: Liu, Z., et al. (2015). Deep Learning Face Attributes in the Wild. ICCV.

Infrastructure:

ASU SOL Supercomputer

My Contributions

What I Built:

Trained DDGAN on CelebA-HQ (128×128, 1,200 epochs across 3 GPUs)
Built ResNet-18 classifier for 40 attributes on 202K images
Engineered complete latent editing pipeline from scratch
Generated and analyzed 30,000 synthetic faces
Performed comprehensive entanglement analysis
Documented limitations and future improvements
Created automated evaluation metrics

← Back to Projects View Presentation Read Paper