From U-Net to Transformer: A New Approach to Diffusion Models on MNIST and CIFAR-10

When it comes to diffusion models, the U-Net architecture is the go-to choice for the denoising network. I’ve mostly relied on U-Nets myself, but in this experiment, I deliberately removed the U-Net and tried a Transformer-based model, specifically the Transformer2DModel (DiT: Diffusion Transformer).

Experimental Setup

Datasets: MNIST / CIFAR-10
Generation Model: Transformer2DModel (no U-Net)
Schedulers: DDPM + DPMSolver++
Latent-space version: VAE + Transformer
Memory optimization: AMP (Automatic Mixed Precision) + Gradient Checkpointing

For MNIST, I trained the Transformer directly in pixel space.
For CIFAR-10, I tried two variants:

Transformer in pixel space
Latent diffusion version: VAE compressing to an 8×8 latent space + Transformer

Results

The initial samples were almost pure noise.

Compared to U-Net, training a diffusion model with only a Transformer seems more challenging — early in training, shapes barely emerged.
On MNIST, noise was somewhat reduced but digits were still not recognizable.

CIFAR-10 required even more training time and careful parameter tuning.

Later, I tried the latent-space version (VAE + Transformer), but the results were still a complete failure.

Attempt at Video Generation

I even attempted video generation — but it was also a complete flop.

https://www.youtube.com/shorts/Kwo5Q3gAKaQ

Conclusion

This was a total failure, so my next goal is simply to get DiT to produce recognizable images first. I also want to try U-Net on larger images for comparison.

From U-Net to Transformer: A New Approach to Diffusion Models on MNIST and CIFAR-10

Yuichiro Minato