🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090
In the world of quantum algorithm acceleration, GPU performance plays a critical role.
In this article, we benchmark three generations of GPUs—RTX 5090, 4090, and 3090—using NVIDIA’s quantum development platform, CUDA-Q, and compare execution time across different quantum bit (qubit) sizes.
⚙️ Experiment Overview
- Library: CUDA-Q (GHZ state simulation using
cudaq.sample()
) - Comparison targets:
- GPUs: RTX 5090 / 4090 / 3090
- CPUs: Corresponding CPUs for each GPU (3090: unspecified, 4090: EPYC 7B12, 5090: Ryzen 9 9900X)
- Qubit ranges:
- 5–25 qubits: CPU and GPU performance comparison
- 26–31 qubits: GPU-only evaluation (CPU execution becomes impractical)
- Measurement: Execution time (in seconds) for a single call to
cudaq.sample()
📈 Execution Time Graph
The left graph shows CPU vs GPU performance for 5–25 qubits, while the right shows GPU-only execution for 26–31 qubits.
🔍 Key Observations
✅ Generational Differences in GPUs Are Clear
- RTX 5090 delivered consistently outstanding performance.
- Achieved sub-millisecond execution (0.0006s) for circuits under 10 qubits.
- Even at 25 qubits, execution took only about 0.004 seconds—over 10x faster than previous generations.
- RTX 3090 was stable but slower overall, with performance differences becoming significant after 20 qubits.
✅ GPUs Remain Effective Beyond 25 Qubits
- CPU execution times increase dramatically (up to 77 seconds),
- While GPUs completed even 30-qubit circuits in under a second.
- CUDA-Q’s backend efficiently maps quantum circuits to GPU architectures for fast simulation.
📊 Highlighted Results
Qubits | 5090 GPU | 4090 GPU | 3090 GPU |
---|---|---|---|
10 | 0.0006s | 0.0911s | 0.0030s |
20 | 0.0014s | 0.0043s | 0.0032s |
25 | 0.0042s | 0.0243s | 0.0347s |
31 | 0.2328s | 0.6469s | 0.8427s |
Note: All units are in seconds. Lower is faster.
🧠 Summary
Category | Conclusion |
---|---|
Fastest GPU | RTX 5090 (dominates across all bit ranges) |
Previous Gen | 3090 is still usable, but approaches its limits beyond 25 qubits |
CUDA-Q Strength | Easy to benchmark across GPUs with simple backend switching |
Practicality | Even circuits beyond 25 qubits can be executed in under a second, making it viable for research and industrial use |
Although the 5090 has 32GB of VRAM, the maximum size for state vector simulations remained 31 qubits, same as the 4090 and 3090.
🧪 Future Directions
- Evaluating performance on even larger quantum circuits (35+ qubits)
- Testing CUDA-Q with quantum machine learning (QML) algorithms
- Benchmarking against other parts of the NVIDIA quantum stack, like cuQuantum
✅ System Environment (Reference)
Item | Detail |
---|---|
OS | Ubuntu 22.04 |
CUDA | 12.1 |
CUDA-Q | Latest version |
GPU VRAM | 5090: 32GB / 4090: 24GB / 3090: 24GB |
Python | 3.10 |
Measurement Method | Execution time of cudaq.sample(kernel) |