common.title

Docs
Quantum Circuit
TYTAN CLOUD

QUANTUM GAMING


Desktop RAG

Overview
Terms of service

Privacy policy

Contact
Research

Sign in
Sign up
common.title

🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090

Yuichiro Minato

2025/03/28 14:58

🚀 GPU Performance Benchmark with CUDA-Q: Comparing 5090 / 4090 / 3090

In the world of quantum algorithm acceleration, GPU performance plays a critical role.
In this article, we benchmark three generations of GPUs—RTX 5090, 4090, and 3090—using NVIDIA’s quantum development platform, CUDA-Q, and compare execution time across different quantum bit (qubit) sizes.

⚙️ Experiment Overview

  • Library: CUDA-Q (GHZ state simulation using cudaq.sample())
  • Comparison targets:
    • GPUs: RTX 5090 / 4090 / 3090
    • CPUs: Corresponding CPUs for each GPU (3090: unspecified, 4090: EPYC 7B12, 5090: Ryzen 9 9900X)
  • Qubit ranges:
    • 5–25 qubits: CPU and GPU performance comparison
    • 26–31 qubits: GPU-only evaluation (CPU execution becomes impractical)
  • Measurement: Execution time (in seconds) for a single call to cudaq.sample()

📈 Execution Time Graph

The left graph shows CPU vs GPU performance for 5–25 qubits, while the right shows GPU-only execution for 26–31 qubits.

image

🔍 Key Observations

✅ Generational Differences in GPUs Are Clear

  • RTX 5090 delivered consistently outstanding performance.
    • Achieved sub-millisecond execution (0.0006s) for circuits under 10 qubits.
    • Even at 25 qubits, execution took only about 0.004 seconds—over 10x faster than previous generations.
  • RTX 3090 was stable but slower overall, with performance differences becoming significant after 20 qubits.

✅ GPUs Remain Effective Beyond 25 Qubits

  • CPU execution times increase dramatically (up to 77 seconds),
  • While GPUs completed even 30-qubit circuits in under a second.
  • CUDA-Q’s backend efficiently maps quantum circuits to GPU architectures for fast simulation.

📊 Highlighted Results

Qubits 5090 GPU 4090 GPU 3090 GPU
10 0.0006s 0.0911s 0.0030s
20 0.0014s 0.0043s 0.0032s
25 0.0042s 0.0243s 0.0347s
31 0.2328s 0.6469s 0.8427s

Note: All units are in seconds. Lower is faster.

🧠 Summary

Category Conclusion
Fastest GPU RTX 5090 (dominates across all bit ranges)
Previous Gen 3090 is still usable, but approaches its limits beyond 25 qubits
CUDA-Q Strength Easy to benchmark across GPUs with simple backend switching
Practicality Even circuits beyond 25 qubits can be executed in under a second, making it viable for research and industrial use

Although the 5090 has 32GB of VRAM, the maximum size for state vector simulations remained 31 qubits, same as the 4090 and 3090.

🧪 Future Directions

  • Evaluating performance on even larger quantum circuits (35+ qubits)
  • Testing CUDA-Q with quantum machine learning (QML) algorithms
  • Benchmarking against other parts of the NVIDIA quantum stack, like cuQuantum

✅ System Environment (Reference)

Item Detail
OS Ubuntu 22.04
CUDA 12.1
CUDA-Q Latest version
GPU VRAM 5090: 32GB / 4090: 24GB / 3090: 24GB
Python 3.10
Measurement Method Execution time of cudaq.sample(kernel)

© 2025, blueqat Inc. All rights reserved