From U-Net to Transformer: A New Approach to Diffusion Models on MNIST and CIFAR-10
When it comes to diffusion models, the U-Net architecture is the go-to choice for the denoising network. I’ve mostly relied on U-Nets myself, but in this experiment, I deliberately removed the U-Net and tried a Transformer-based model, specifically the Transformer2DModel
(DiT: Diffusion Transformer).
Experimental Setup
- Datasets: MNIST / CIFAR-10
- Generation Model:
Transformer2DModel
(no U-Net) - Schedulers: DDPM + DPMSolver++
- Latent-space version: VAE + Transformer
- Memory optimization: AMP (Automatic Mixed Precision) + Gradient Checkpointing
For MNIST, I trained the Transformer directly in pixel space.
For CIFAR-10, I tried two variants:
- Transformer in pixel space
- Latent diffusion version: VAE compressing to an 8×8 latent space + Transformer
Results
The initial samples were almost pure noise.
Compared to U-Net, training a diffusion model with only a Transformer seems more challenging — early in training, shapes barely emerged.
On MNIST, noise was somewhat reduced but digits were still not recognizable.
CIFAR-10 required even more training time and careful parameter tuning.
Later, I tried the latent-space version (VAE + Transformer), but the results were still a complete failure.
Attempt at Video Generation
I even attempted video generation — but it was also a complete flop.
https://www.youtube.com/shorts/Kwo5Q3gAKaQ
Conclusion
This was a total failure, so my next goal is simply to get DiT to produce recognizable images first. I also want to try U-Net on larger images for comparison.