🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090

In the world of quantum algorithm acceleration, GPU performance plays a critical role.
In this article, we benchmark three generations of GPUs—RTX 5090, 4090, and 3090—using NVIDIA’s quantum development platform, CUDA-Q, and compare execution time across different quantum bit (qubit) sizes.

⚙️ Experiment Overview

Library: CUDA-Q (GHZ state simulation using cudaq.sample())
Comparison targets:
- GPUs: RTX 5090 / 4090 / 3090
- CPUs: Corresponding CPUs for each GPU (3090: unspecified, 4090: EPYC 7B12, 5090: Ryzen 9 9900X)
Qubit ranges:
- 5–25 qubits: CPU and GPU performance comparison
- 26–31 qubits: GPU-only evaluation (CPU execution becomes impractical)
Measurement: Execution time (in seconds) for a single call to cudaq.sample()

📈 Execution Time Graph

The left graph shows CPU vs GPU performance for 5–25 qubits, while the right shows GPU-only execution for 26–31 qubits.

🔍 Key Observations

✅ Generational Differences in GPUs Are Clear

RTX 5090 delivered consistently outstanding performance.
- Achieved sub-millisecond execution (0.0006s) for circuits under 10 qubits.
- Even at 25 qubits, execution took only about 0.004 seconds—over 10x faster than previous generations.
RTX 3090 was stable but slower overall, with performance differences becoming significant after 20 qubits.

✅ GPUs Remain Effective Beyond 25 Qubits

CPU execution times increase dramatically (up to 77 seconds),
While GPUs completed even 30-qubit circuits in under a second.
CUDA-Q’s backend efficiently maps quantum circuits to GPU architectures for fast simulation.

📊 Highlighted Results

Qubits	5090 GPU	4090 GPU	3090 GPU
10	0.0006s	0.0911s	0.0030s
20	0.0014s	0.0043s	0.0032s
25	0.0042s	0.0243s	0.0347s
31	0.2328s	0.6469s	0.8427s

Note: All units are in seconds. Lower is faster.

🧠 Summary

Category	Conclusion
Fastest GPU	RTX 5090 (dominates across all bit ranges)
Previous Gen	3090 is still usable, but approaches its limits beyond 25 qubits
CUDA-Q Strength	Easy to benchmark across GPUs with simple backend switching
Practicality	Even circuits beyond 25 qubits can be executed in under a second, making it viable for research and industrial use

Although the 5090 has 32GB of VRAM, the maximum size for state vector simulations remained 31 qubits, same as the 4090 and 3090.

🧪 Future Directions

Evaluating performance on even larger quantum circuits (35+ qubits)
Testing CUDA-Q with quantum machine learning (QML) algorithms
Benchmarking against other parts of the NVIDIA quantum stack, like cuQuantum

✅ System Environment (Reference)

Item	Detail
OS	Ubuntu 22.04
CUDA	12.1
CUDA-Q	Latest version
GPU VRAM	5090: 32GB / 4090: 24GB / 3090: 24GB
Python	3.10
Measurement Method	Execution time of `cudaq.sample(kernel)`

🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090

Yuichiro Minato

🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090

⚙️ Experiment Overview

📈 Execution Time Graph

🔍 Key Observations

✅ Generational Differences in GPUs Are Clear

✅ GPUs Remain Effective Beyond 25 Qubits

📊 Highlighted Results

🧠 Summary

🧪 Future Directions

✅ System Environment (Reference)