Project Overview

Project Overview

Exploring DDGAN’s capability for fast generation and discovering semantic directions in latent space for controlled facial attribute manipulation on CelebA-HQ.

This project explores Denoising Diffusion GANs (DDGAN), a hybrid approach that combines the sample quality of diffusion models with the speed of GANs. While standard diffusion models require 1000+ denoising steps, DDGAN achieves comparable quality in just 2-4 steps, enabling near-real-time generation.

I extended the original work by building a complete latent space editing pipeline for facial attributes, training on the CelebA-HQ dataset and discovering semantic directions for controllable image manipulation.


The Generative Learning Trilemma

For generative models to see widespread real-world use, they should ideally meet three key requirements:

  • High-quality sampling — The outputs must look and sound natural, especially for direct user-facing tasks (e.g., speech clarity, photorealistic images).
  • Mode coverage and sample diversity — Successful models capture the full variety of training data, including rare examples, thus avoiding bias and missing important cases.
  • Fast, inexpensive sampling — Real-time and interactive applications need speedy, low-compute generation; this also reduces the environmental impact of running large neural networks.

Most existing methods excel in only one or two aspects, making trade-offs between them. Capturing all three is difficult—this is known as the generative learning trilemma. Failing on diversity leads to biased, brittle models; failing on speed limits real-time use.

Trilemma Diagram Placeholder
Illustration: The generative learning trilemma—quality, diversity, and speed.


What is DDGAN?

Denoising Diffusion GAN (DDGAN)
DDGAN is designed as a solution to the trilemma. It produces images that are visually realistic (high-quality), covers diverse modes of the data (diversity), and does so with ultra-fast generation (speed). By combining the stable denoising processes of diffusion models with adversarial GAN training, DDGAN delivers high-fidelity results in just a few steps, instead of thousands—unlocking potential for real-time editing and deployment.

How DDGAN Works

The Core Concept:

DDGAN solves a fundamental challenge: how to generate high-quality, diverse images quickly. Traditional diffusion models produce great results but are extremely slow (requiring 1000+ steps). GANs are fast but can lack diversity. DDGAN combines both approaches.

The Process:

  1. Forward Diffusion: Start with a clean image and gradually add noise over several timesteps (in this project, either 2 or 4 steps)

  2. Conditional Generation: The generator takes a noisy image and tries to denoise it, conditioned on the current timestep and a random latent vector

  3. Adversarial Training: A discriminator network distinguishes between real and generated denoising transitions, forcing the generator to learn realistic outputs

  4. Reverse Sampling: To create new images, start from pure noise and apply the generator repeatedly to denoise step-by-step

Key Innovation: Using adversarial training allows the model to handle complex, multimodal denoising in just 2-4 steps instead of 1000, achieving 600× speedup.

DDGAN Architecture The training process: Generator learns to denoise images while Discriminator ensures realistic transitions


What I Accomplished

Phase 1: Training the Face Generator

  • Trained DDGAN on 5,000 celebrity faces at 128×128 resolution
  • Ran training for 1,200 epochs over 48 hours
  • Used 3 NVIDIA A30 GPUs on ASU’s SOL Supercomputer
  • Generated 30,000 high-quality synthetic faces
  • Achieved generation speed of 0.025 seconds per image
  • Quality metric (FID): 17.16

Phase 2: Building the Attribute Classifier

  • Created a ResNet-18 neural network to recognize 40 facial attributes
  • Trained on the full CelebA dataset (202,599 images)
  • Attributes include: age, smile, beard, glasses, hair color, gender, etc.
  • Achieved over 90% accuracy on most attributes
  • This classifier became the foundation for controllable editing

Phase 3: Discovering Latent Directions

  • Generated 30,000 images and saved their latent codes
  • Used the classifier to label each generated image
  • Applied logistic regression to find “direction vectors” for each attribute
  • Each direction represents how to manipulate one facial feature
  • Created 40 different editing directions (one per attribute)

Phase 4: Interactive Image Editing

  • Built a system to edit faces by moving along discovered directions
  • Can control edit strength with a “gain” parameter
  • Successfully manipulated: age, smile, beard, glasses, and more
  • Documented limitations and entanglement issues

Technical Setup

Frameworks: PyTorch, torchvision, distributed training
Metrics: FID (Fréchet Inception Distance), Inception Score
Hardware: 3× NVIDIA A30 GPUs (24GB each)
Supercomputer: ASU SOL
Training Time: ~48 hours per configuration
Dataset: CelebA-HQ (5K subset for generation, 202K full for classifier)


Results: Generated Faces

Synthetic Celebrity Faces

CelebA Generated Grid 30,000 high-fidelity celebrity faces generated with DDGAN

Quality Characteristics:

  • Photorealistic facial features and skin textures
  • Diverse ages, genders, and ethnicities
  • Variety in hairstyles, expressions, and accessories
  • Minimal visual artifacts or distortions
  • Fast generation: 0.025 seconds per image

Metrics:

  • FID Score: 17.16 (indicates good quality for 128×128 resolution)
  • Speed: 600× faster than standard diffusion models
  • Diversity: High variance across all 40 facial attributes

The Latent Editing System

How It Works

My editing framework follows four steps:

Step 1: Generate Images with Latent Codes

  • Created 30,000 synthetic faces
  • Saved the random latent vector used to generate each image
  • Each latent vector is 128 numbers that encode the entire face

Step 2: Label Images with Attributes

  • Used the ResNet-18 classifier to analyze each generated image
  • Classified presence/absence of 40 attributes per image
  • Example: Image #5432 has “Young: Yes, Smiling: Yes, Beard: No, Glasses: Yes”

Step 3: Find Semantic Directions

  • For each attribute (e.g., “Smiling”), separated images into two groups: those with the attribute and those without
  • Used logistic regression to find the direction in latent space that best separates these groups
  • This direction becomes the “smile direction” that can be applied to any latent code

Step 4: Edit Images

  • Take any original latent code
  • Add the desired direction vector (e.g., “smile direction”)
  • Control strength with a multiplier (gain)
  • Generate new image from the edited latent code

Latent Editing Pipeline Complete pipeline: Generation → Classification → Direction Finding → Editing


Editing Results

Successful Manipulations

Age Progression:

Young Progression Smooth age transition from older to younger appearance

The progression shows how increasing the “young” direction smoothly transforms the face:

  • Original (Gain 0.0): Older appearance
  • Gain 1.0-2.0: Gradually younger features
  • Gain 3.0: Young adult appearance
  • Gain 4.0-5.0: Very young face

Adding a Beard:

Beard Progression Gradual beard growth from clean-shaven to full beard

Other Successful Edits:

  • Smiling: Transition from neutral expression to happy smile
  • Eyeglasses: Add or remove glasses smoothly
  • Hair Color: Some success changing brown to blond
  • Gender Features: Male to female transformations (with side effects)

Challenges: Attribute Entanglement

The Problem

Editing one attribute often unintentionally changes others. This is called “entanglement.”

Example: When Beard Editing Goes Wrong

Entangled Example Adding a beard also changes gender, adds a hat, and alters face shape

Why This Happens

1. No Guidance During Training

  • DDGAN was trained only on images, without attribute labels
  • The model organized its latent space based on natural patterns in the data
  • No explicit instruction to keep attributes separate

2. Dataset Correlations

  • In CelebA, beards appear almost exclusively on men (99%)
  • Blond hair correlates strongly with young, female celebrities
  • Certain accessories (like hats) appear more often with specific age groups

3. Post-Training Direction Discovery

  • Directions were found after training was complete
  • No guarantee that different directions are independent
  • Directions may overlap when attributes naturally co-occur

4. Limited Latent Space

  • Only 128 dimensions to encode an entire face
  • Must represent 40+ different attributes
  • Inevitable sharing and overlap between features

Entanglement Analysis

Measured Correlations:

Edit TargetUnintended ChangeCorrelationRoot Cause
Blond HairYounger appearance+0.72Dataset bias: young celebrities often blond
Blond HairFemale gender-0.68Blond more common in women in dataset
BeardMale gender+0.8199% of beards in dataset are male
BeardWearing hat+0.43Rare co-occurrence creates confusion
EyeglassesOlder age+0.35Glasses more common on older faces

High correlation (>0.5) indicates significant entanglement


Key Findings

What Worked Well

Extreme Speed: 600× faster than standard diffusion models
High Quality: Competitive image quality (FID ~3-17 depending on dataset)
Diversity: Wide variety in generated faces across all attributes
Controllable Editing: Successfully manipulated multiple facial features
Smooth Transitions: Gradual changes when adjusting edit strength

Limitations Discovered

Attribute Entanglement: Cannot isolate single attributes perfectly
Dataset Biases: Inherited demographic skews from CelebA
Side Effects: Beard edits add hats, blond hair changes gender
Classifier Dependency: Editing quality limited by classifier accuracy
Resolution Constraints: Trained at 128×128 due to computational limits

Trade-offs Identified

2 Steps vs 4 Steps:

  • T=2: Faster (3× speed), good for demos, but overfits with long training
  • T=4: Best quality, more stable, but slower generation

DDGAN vs StyleGAN:

  • DDGAN: Faster generation, better diversity, less control precision
  • StyleGAN: Slower, more disentangled latents, better editing control

Future Improvements

For Better Quality

  • Train with 4 steps beyond 1,400 epochs (target: FID < 3.0)
  • Scale to higher resolutions (256×256 or 512×512)
  • Use larger network architecture with self-attention

For Better Control

  • Add attribute conditioning during training (not after the fact)
  • Orthogonalize directions to reduce entanglement
  • Use disentanglement losses to encourage independence
  • Hierarchical latent structure for coarse-to-fine control

Applications

Synthetic Data Generation:

  • Create privacy-preserving face datasets
  • Balance datasets lacking certain demographics
  • Generate training data augmentation

Image Editing:

  • Age progression for forensic applications
  • Virtual try-on for glasses, hairstyles, accessories
  • Expression modification for animation

Research:

  • Detect and measure algorithmic biases
  • Test fairness of face recognition systems
  • Study model interpretability

Conclusion

This project successfully demonstrates DDGAN’s capability for ultra-fast, high-quality image generation and explores the controllability of its latent space.

Key Takeaways:

  1. Speed Advantage: DDGAN achieves excellent quality 600× faster than traditional diffusion models, making real-time generation feasible

  2. Quality-Speed Trade-off: Using 2 steps is fastest but may overfit; 4 steps provides better final quality at modest speed cost

  3. Latent Editing is Possible: Despite no explicit conditioning, the latent space naturally encodes semantic attributes that can be manipulated

  4. Entanglement Limits Control: Without attribute conditioning during training, correlated features interfere with precise editing

  5. Future Potential: Adding explicit attribute guidance during training could dramatically improve editing precision while maintaining speed

Overall: DDGAN excels at its core purpose—fast, diverse, high-quality generation. While latent editing works, achieving precise control comparable to StyleGAN would require architectural modifications or explicit conditioning during training.


References

Primary Paper:

Code:

Datasets:

  • CelebA: Liu, Z., et al. (2015). Deep Learning Face Attributes in the Wild. ICCV.

Infrastructure:

  • ASU SOL Supercomputer

My Contributions

What I Built:

  • Trained DDGAN on CelebA-HQ (128×128, 1,200 epochs across 3 GPUs)
  • Built ResNet-18 classifier for 40 attributes on 202K images
  • Engineered complete latent editing pipeline from scratch
  • Generated and analyzed 30,000 synthetic faces
  • Performed comprehensive entanglement analysis
  • Documented limitations and future improvements
  • Created automated evaluation metrics

← Back to Projects View Presentation Read Paper