common.title

Docs
Quantum Circuit
TYTAN CLOUD

QUANTUM GAMING


Desktop RAG

Overview
Terms of service

Privacy policy

Contact
Research

Sign in
Sign up
common.title

Quantum Computer Simulation: Quantum Fourier Transform with CUDA-Q Using Multi-Node H200 GPUs

Yuichiro Minato

2025/04/01 07:32

High-Speed Simulation of Quantum Fourier Transform Using CUDA-Q with Single and Multi-GPU Nodes

As quantum computing moves closer to practical application, the importance of GPU-accelerated simulation continues to grow. With the introduction of CUDA-Q by NVIDIA, the GPU-based execution environment for quantum algorithms has been significantly enhanced, enabling faster research and development.

In this article, we focus on one of the most fundamental and important quantum algorithms—Quantum Fourier Transform (QFT)—and benchmark the simulation speed using CUDA-Q across single GPU and multi-GPU (up to 8 GPUs) environments. The GPUs used are RTX 4090 and NVIDIA H200.

What is CUDA-Q?

CUDA-Q is a quantum programming and execution framework developed by NVIDIA. It enables high-speed quantum circuit simulation using GPUs by leveraging libraries such as cuStateVec and cuTensorNet included in the cuQuantum SDK. These libraries efficiently handle quantum state vectors and tensor network calculations.

CUDA-Q provides a Python interface, allowing users to implement quantum circuits that grow in complexity with the number of qubits, while executing them efficiently by utilizing the massive parallelism of GPUs.

Benchmark Conditions

Algorithm:

  • Quantum Fourier Transform (QFT)
  • Measured simulation time for varying input qubit counts

Hardware Setup:

Environment GPU Configuration Notes
RTX 4090 Single 1 GPU
RTX 4090 Multi x2 2 GPUs (MPI)
H200 Single 1 GPU
H200 Multi x8 8 GPUs (MPI)

Results: QFT Simulation Time Comparison

The following graph shows the QFT simulation time (in seconds) across different environments:

Note: The Y-axis is in logarithmic scale.

image

Observations:

  • RTX 4090 Single performs very efficiently, handling up to 31 qubits.
  • RTX 4090 Multi x2 introduces some overhead but extends the maximum qubit count to 32.
  • H200 Single shows minimal performance degradation as qubit count increases, reaching up to 34 qubits.
  • H200 Multi x8 demonstrates excellent scalability and makes realistic simulation of 36+ qubits feasible. (While 37 qubits seemed possible, it resulted in an error.)

Analysis: Differences by GPU Configuration

  • RTX 4090 offers excellent single-GPU performance, ideal for small to medium-scale quantum circuits. With two GPUs, QFT circuits with up to 32 qubits become feasible.
  • H200 excels in memory bandwidth and interconnect performance. For large circuits (30+ qubits), multi-node configurations offer significant performance benefits.
  • CUDA-Q's multi-GPU backend uses MPI for process distribution, so the performance is influenced by MPI setup and GPU interconnect optimization.

Conclusion

By using CUDA-Q along with the most powerful GPUs available today, it is now feasible to simulate Quantum Fourier Transform circuits with 30 to 36 qubits. This demonstrates the critical role of GPU-based parallel simulation in testing large-scale quantum algorithms with error correction in the future.

Setting up a multi-GPU MPI environment was surprisingly easy using conda install, making it straightforward to get started with distributed CUDA-Q simulations.

For more information, refer to the official documentation on multi-GPU setup using conda:

👉 https://pypi.org/project/cudaq/

© 2025, blueqat Inc. All rights reserved