DC-GAN Study | Sia Khorsand

Abstract

Generative Adversarial Networks (GANs) have emerged as a significant method in unsupervised learning, demonstrating remarkable capabilities in generating realistic synthetic data. This study presents a comprehensive implementation and analysis of Deep Convolutional Generative Adversarial Networks (DC-GANs) applied to CIFAR-10 and CelebA datasets.

I conduct an extensive empirical investigation examining the impact of different activation functions, optimization strategies, and hyper-parameter configurations on model performance and training stability. Through systematic comparisons of ReLU and ELU activations across varied learning-rate configurations, I demonstrate DC-GAN effectiveness in generating high-quality synthetic images while providing insights into training dynamics and output quality.

The results contribute to understanding GAN training processes and offer practical guidelines for implementing DC-GANs across different image-generation tasks. My findings indicate that activation-function choice and hyper-parameter tuning significantly impact both training stability and sample quality, with notable differences observed between natural-object datasets and human-face datasets.

1 Introduction

The field of generative modeling has experienced huge advancement with the introduction of Generative Adversarial Networks by Goodfellow et al. in 2014. These networks have revolutionized the approach to unsupervised learning by introducing a novel adversarial training paradigm that pits two neural networks against each other in a minimax game.

The generator network learns to create realistic data samples from random noise with the goal of fooling the discriminator, while the discriminator network learns to distinguish between real and generated samples. This adversarial process drives both networks to improve iteratively, resulting in generators capable of producing highly realistic synthetic data that can successfully deceive even well-trained discriminators. In short, GANs are basically arm-wrestling matches between two competing neural networks.

The evolution from basic GANs to Deep Convolutional GANs (DC-GANs) was a crucial advancement in the field, addressing many of the training instabilities and mode-collapse issues that plagued early implementations. DC-GANs introduced architecture that significantly improved training stability and output quality, making them particularly effective for image-generation tasks.

Despite these advancements, training GANs remains a challenging task characterized by delicate balance requirements between generator and discriminator performance. The sensitivity to hyper-parameter choices, architectural decisions, and optimization strategies necessitates comprehensive empirical investigation to understand optimal configurations for different datasets and applications.

Research Objectives

Implementing robust DC-GAN architectures capable of generating high-quality samples on both datasets

Conducting hyper-parameter and architectural optimization to identify optimal configurations for different scenarios

Analyzing the impact of various architectural choices on training dynamics and output quality

Providing comparative analysis between dataset-specific behaviors and requirements

2 Methodology

2.1 Architecture Design

My DC-GAN implementation loosely follows the architectural guidelines established by Radford et al., with systematic variations to explore the impact of different design choices. The architecture consists of two competing networks working in an adversarial framework.

Generator Network

Employs a series of transposed-convolution layers to progressively up-sample random-noise vectors into full-resolution images, beginning with a dense layer that reshapes the noise into a small spatial feature map.

Transposed convolutions for upsampling

Batch normalization for stability

ReLU and ELU activations (varied across experiments)

Tanh output activation

Discriminator Network

Progressively down-samples input images to a binary classification, using convolutions and LeakyReLU activations with spectral normalization for CIFAR-10.

Convolutional layers for feature extraction

LeakyReLU (α = 0.2) activations

Spectral normalization (CIFAR-10)

Binary classification output

2.2 Training Strategy

The training process implements the standard GAN minimax objective, updating discriminator and generator in alternating steps. Multiple stabilization techniques were employed to ensure robust training.

Stabilization Techniques

Spectral normalization (CIFAR-10)

Exponential-moving-average (EMA) weight tracking

Instance-noise decay

Label smoothing

Careful weight initialization

Mixed-precision with gradient scaling

Adam optimizer (β₁ = 0.5, β₂ = 0.999)

Learning Rate Schedules

Balanced:
G_LR = D_LR = 1 × 10⁻⁴

Asymmetric (CIFAR-10):
G_LR = 5 × 10⁻⁵, D_LR = 1 × 10⁻⁴

Face-tuned (CelebA):
G_LR = 2 × 10⁻⁴, D_LR = 3 × 10⁻⁴

2.3 Experimental Design

For each activation × learning-rate setting I train three seeds, log losses, checkpoint every five epochs, and compute Inception Score (50k samples, 10 splits). CIFAR-10 runs for 100 epochs; CelebA converges by epoch 25.

Datasets

Two fundamentally different image datasets were selected to evaluate DC-GAN performance across varied domains: CIFAR-10 for diverse object categories and CelebA for high-fidelity human faces.

CIFAR-10

60,000 images • 32×32 pixels • 10 classes

Contains 60,000 32×32 colour images across ten classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

Low resolution ideal for initial GAN testing

Challenging due to object diversity

Normalized pixels to [-1, 1]

Random flips and small rotations applied

Challenge: Diverse textures and object categories require robust feature learning

CelebA

~50,000 images • 64×64 pixels • Celebrity faces

Comprises >200,000 aligned celebrity faces annotated with 40 binary attributes. I use ≈50,000 high-quality images, centre-cropped and resized to 64×64.

Higher resolution for detailed features

Structured domain (human faces)

Centre-cropped and aligned faces

Corrupted images removed during preprocessing

Challenge: Fine-grained detail in skin texture, symmetry, and facial features

Preprocessing Pipeline

CIFAR-10:
• Pixel normalization [-1, 1]
• Random horizontal flips
• Small rotation augmentation

CelebA:
• Quality filtering
• Centre crop faces
• Resize to 64×64
• Pixel normalization [-1, 1]

4 CIFAR-10 Results & Discussion

4.1 Training Dynamics Analysis

The CIFAR-10 experiments revealed significant differences in training stability and output quality between activation functions and learning rate configurations. Three key scenarios emerged from the systematic evaluation.

ReLU Activation

Best Performance: IS 5.49 ± 1.8

ReLU's sparse activations preserve high-frequency detail essential for diverse object generation. Performs best with asymmetric learning rates.

Sharp, colorful object generation

Requires asymmetric LR (D = 2 × G)

Healthy adversarial dynamics

Nearly doubled ELU score

ELU Activation

Inception Score: 2.87 ± 0.98

ELU's smooth negative region led to oversmoothing on CIFAR-10's diverse textures, resulting in mode collapse and poor sample quality.

Desaturated, blurry outputs

Dominant discriminator dynamics

Mode collapse evident

Consistent underperformance

CIFAR-10 Training Results

ELU + Balanced LR: Diverging losses, mode collapse, blurry outputs (IS: 2.87 ± 0.98)

ReLU + Asymmetric LR: Healthy dynamics, vivid objects, sharp details (IS: 5.49 ± 1.8)

ReLU + Balanced LR: Flat losses, generator complacency, uniform grey patches

Key Training Scenarios

Best: ReLU + Asymmetric LR

G: 1e-4, D: 2e-4 → Healthy dynamics, vivid objects, IS 5.49

Worst: ELU + Balanced LR

Diverging losses, mode collapse, blurry blobs, IS 2.87

Problematic: ReLU + Balanced LR

Flat losses, generator complacency, uniform grey patches

CIFAR-10 Performance

IS: 5.49 ± 1.8

Optimal Configuration:
ReLU + Asymmetric Learning Rate

Key Insights

Activation choice crucial for object diversity

Asymmetric LR prevents discriminator dominance

Early loss divergence indicates mode collapse

ReLU preserves high-frequency textures

4.2 Architecture Impact Assessment

ELU's smooth negative region oversmooths outputs on CIFAR-10, while ReLU's sparse activations preserve high-frequency detail. However, activation alone is insufficient—ReLU needs an asymmetric LR (higher D) to excel on heterogeneous objects.

5 CelebA Results & Discussion

5.1 Architecture Impact Assessment

CelebA's facial geometry stabilizes training for both activations, but distinct differences emerge in output quality and fine-grained detail preservation. The structured nature of faces allows both ReLU and ELU to achieve reasonable stability, highlighting the importance of activation choice for detail rendering.

CelebA Training Results Comparison

Training Dynamics Comparison: ReLU (solid lines) vs ELU (dashed lines) showing generator and discriminator losses. ReLU demonstrates superior performance with better loss convergence.

Best ReLU Run: Photorealistic faces with sharp details, varied demographics (IS: 6.82 ± 1.4)

ReLU Reproducibility: More "face melting" and less detail visible

ReLU: Photorealistic Detail

Inception Score: 6.82 ± 1.4

ReLU better preserves hair strands and skin pores, generating photorealistic faces with sharp detail and accurate anatomy across diverse demographics.

Sharp skin texture and pores

Detailed hair strand rendering

Accurate facial anatomy

Varied lighting and demographics

Consistent reproducibility across seeds

ELU: Airbrushed Softness

Best Score: 4.91 ± 1.2

ELU yields softer, airbrushed faces with less texture detail. While aesthetically pleasing, lacks the fine-grained realism achieved by ReLU.

Smoother facial features

Airbrushed skin appearance

Less hair detail

Good overall structure

Lower inception scores consistently

5.2 Hyperparameter Optimization

Unlike CIFAR-10, CelebA proved more robust to learning rate variations. The structured domain of human faces allows for more balanced training dynamics, though ReLU still demonstrated clear superiority.

CelebA Performance

IS: 6.82 ± 1.4

Optimal Configuration:
ReLU + Face-tuned LR
(G: 2e-4, D: 3e-4)

CelebA Insights

Activation choice outweighs LR tuning

Facial structure stabilizes training

ReLU preserves fine details better

Higher resolution shows clear differences

Consistent quality across demographics

5.3 Sample Quality Evaluation

Quality Comparison Summary

ReLU Characteristics:
• Photorealistic skin texture
• Sharp hair definition
• Detailed facial features
• Natural lighting effects

ELU Characteristics:
• Smooth, airbrushed skin
• Softer hair rendering
• Less textural detail
• Pleasant but less realistic

ReLU renders high-frequency detail (skin, hair) convincingly, while ELU yields softer features and lower Inception Scores, confirming ReLU's superiority for high-fidelity faces. The reproducibility across different seeds demonstrates the stability of the optimal configuration.

6 Analysis & Conclusion

6.1 Comparative Analysis

CIFAR-10 requires aggressive discriminator learning (2 × G) plus ReLU to conquer class diversity; CelebA is learning-rate robust but still favours ReLU for sharp detail. The fundamental difference lies in dataset complexity and domain structure.

CIFAR-10 Requirements

Asymmetric learning rates essential

ReLU critical for texture preservation

High sensitivity to hyperparameters

Diverse object categories challenging

CelebA Characteristics

Learning rate robust

ReLU still superior for detail

Structured domain stabilizes training

Fine-grained texture differences

6.2 Training Insights

Early Warning Signs

Discriminator loss > 1.5 and generator loss pinned at ≈ 0.7 within 20 epochs signal collapse. These indicators proved consistent across all failed experiments.

Activation choice was the dominant stability factor; learning rate asymmetry mattered chiefly for CIFAR-10. The importance of monitoring training dynamics early cannot be overstated—most failure modes manifest within the first 20 epochs.

6.3 Practical Implementation Guidelines

Recommended Best Practices

General Guidelines

Use ReLU activations in DC-GAN generators

Monitor loss dynamics in first 20 epochs

Implement multiple stabilization techniques

Train with multiple random seeds

Diverse Datasets (CIFAR-10-like)

Set D ≈ 2 × G learning rate

Monitor loss divergence carefully

Use spectral normalization

Expect longer convergence times

Structured Datasets (Faces)

Balanced LRs 1×10⁻⁴–5×10⁻⁴ suffice

Focus on activation function choice

Higher resolution reveals differences

Quality assessment via fine details

Conclusion & Future Work

This study demonstrates that while specific architectural choices matter significantly, the optimal configuration depends heavily on dataset characteristics and practical constraints. For image generation tasks, ReLU activation functions provide superior performance when properly tuned with appropriate learning rate schedules.

The most important takeaway is that thorough hyperparameter tuning and systematic evaluation are often more critical than complex architectural modifications. Both CIFAR-10 and CelebA achieved high-quality results when optimal configurations were identified through careful experimentation.

Future Research Directions

Extension to Progressive-GAN and StyleGAN architectures

Exploration of transfer learning strategies

Development of more generalizable generative models

Investigation of attention mechanisms in GANs

Cross-domain style transfer applications

Future work will extend these experiments to Progressive-GAN and StyleGAN architectures and explore transfer-learning strategies to build more generalizable generative models capable of producing high-quality synthetic data across diverse domains.

Full Research Paper

Preview of research paper

Download Full Paper (PDF)

"Deep Convolutional Generative Adversarial Networks: A Comprehensive Study on CIFAR-10 and CelebA Datasets"

Explore the Code

View on GitHub