common.title

Docs
Quantum Circuit
TYTAN CLOUD

QUANTUM GAMING


Overview
Contact
Event
Project
Research

Terms of service (Web service)

Terms of service (Quantum and ML Cloud service)

Privacy policy


Sign in
Sign up
common.title

From U-Net to Transformer: A New Approach to Diffusion Models on MNIST and CIFAR-10

Yuichiro Minato

2025/08/11 05:34

From U-Net to Transformer: A New Approach to Diffusion Models on MNIST and CIFAR-10

When it comes to diffusion models, the U-Net architecture is the go-to choice for the denoising network. I’ve mostly relied on U-Nets myself, but in this experiment, I deliberately removed the U-Net and tried a Transformer-based model, specifically the Transformer2DModel (DiT: Diffusion Transformer).

Experimental Setup

  • Datasets: MNIST / CIFAR-10
  • Generation Model: Transformer2DModel (no U-Net)
  • Schedulers: DDPM + DPMSolver++
  • Latent-space version: VAE + Transformer
  • Memory optimization: AMP (Automatic Mixed Precision) + Gradient Checkpointing

For MNIST, I trained the Transformer directly in pixel space.
For CIFAR-10, I tried two variants:

  1. Transformer in pixel space
  2. Latent diffusion version: VAE compressing to an 8×8 latent space + Transformer

Results

The initial samples were almost pure noise.

image

Compared to U-Net, training a diffusion model with only a Transformer seems more challenging — early in training, shapes barely emerged.
On MNIST, noise was somewhat reduced but digits were still not recognizable.

image

CIFAR-10 required even more training time and careful parameter tuning.

Later, I tried the latent-space version (VAE + Transformer), but the results were still a complete failure.

image

Attempt at Video Generation

I even attempted video generation — but it was also a complete flop.

https://www.youtube.com/shorts/Kwo5Q3gAKaQ

Conclusion

This was a total failure, so my next goal is simply to get DiT to produce recognizable images first. I also want to try U-Net on larger images for comparison.

© 2025, blueqat Inc. All rights reserved