cuQuantum Appliance / cuTensorNet on multi-GPU benchmarks

Yuichiro Minato

2024/02/08 07:33

Hello. Within the realm of quantum computing utilizing GPUs, cuQuantum stands out as somewhat unique. When employing multiple GPUs or machines, it's necessary to install something called cuQuantum Appliance. This time, although I usually focus on cuStateVec, I received some questions and would like to cover cuTensorNet instead.

Multi-GPU / Multi-node version of cuQuantum:

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuquantum-appliance

cuStateVec is the traditional state vector simulator and is considered the standard. On the other hand, cuTensorNet is a simulator still under development.

cuTensorNet is a simulator implemented using a method called tensor networks. Quantum circuits from Google Cirq, IBM Qiskit, and others are first converted into the tensor network format, and then fed into cuTensorNet for computation.

While cuStateVec is used by calling it from within Qiskit or Cirq, cuTensorNet typically involves converting the quantum circuit once and then computing with cuTensorNet.

import cirq
from cirq.testing import random_circuit
import cupy as cp
import numpy as np
from cuquantum import contract, CircuitToEinsum

num_qubits = 10
n_moments = 6
op_density = 0.9
gate_domain = {cirq.H: 1,
        cirq.S: 1,
        cirq.T: 1,
        cirq.CNOT: 2,
        cirq.CZ: 2}
circuit = random_circuit(num_qubits, n_moments, op_density=op_density, gate_domain=gate_domain, random_state=6)
print(circuit)

myconverter = CircuitToEinsum(circuit, dtype='complex128', backend=cp)
expression, operands = myconverter.state_vector()
sv = contract(expression, *operands)
sv.reshape(1, 2**num_qubits)

I've input a random circuit. The conversion process is called "CircuitToEinsum". After converting the quantum circuit with this converter, the computation is determined. This time, since we're trying to obtain the state vector, an additional conversion specific to state vectors is performed. If it's not possible to compute directly with the quantum circuit, cuTensorNet transforms the graph further, depending on the objective, to enable the derivation of desired values using the quantum circuit.

The final computation is carried out at a step called "contract", which operates based on a combination of expressions and operands. This operation involves contraction, with the order and the operators corresponding to this contraction described.

This time, we'll be benchmarking cuTensorNet on a single-node multi-GPU setup.

singularity exec --nv cuquantum-appliance.img mpiexec -n 8 python3 tn.py

The Quantum Circuit is

┌──┐ ┌───┐ ┌───┐ ┌────┐ ┌──┐ ┌────┐
0: ───────────H───────H────────────────T──────X───────
│
1: ─────@───────@──────@───────@────────@─────┼──@────
│ │ │ │ │ │ │
2: ─────┼───────┼─────T┼───────┼@──────T┼─────┼@─┼────
│ │ │ ││ │ ││ │
3: ────@┼─────X─┼──────┼@──────┼┼@─────T┼─────┼┼@┼────
││ │ │ ││ │││ │ ││││
4: ────┼┼─────┼S┼─────T┼┼─────S┼┼┼─────@┼─────@┼┼┼────
││ │ │ ││ │││ ││ │││
5: ────@┼─────┼T┼──────┼@──────┼┼@─────@┼──────@┼┼────
│ │ │ │ ││ │ ││
6: ────H┼─────@─┼─────X┼──────@┼┼──────T┼─────@─┼┼────
│ │ ││ │││ │ │ ││
7: ────T┼───────@─────┼┼──────┼X┼──────H┼─────┼─┼X────
│ ││ │ │ │ │ │
8: ────H┼─────T───────┼X──────@─┼───────@─────┼─@─────
│ │ │ │
9: ─────@─────────────@─────────@─────────────X───────
└──┘ └───┘ └───┘ └────┘ └──┘ └────┘

The speed was as follows:

For 1 GPU - NVIDIA V100 * 1, with 8 parallel executions, the time was 0.99 seconds. This was achieved using the -n 8 option, but it seems that the number of parallels does not necessarily correspond to the number of GPUs, right? The computation was performed without running out of memory.

It's impressively fast.

For 4 GPUs - NVIDIA V100 * 4, with 8 parallel executions, the time was 0.83 seconds.

With this size, it's fast even with 1 GPU. I will try increasing the size a bit more.

num_qubits = 20
n_moments = 10

After adjusting the settings, the results were as follows:

For 1 GPU - NVIDIA V100 * 1, with 8 parallel executions, the time was 1.22 seconds.

For 4 GPUs - NVIDIA V100 * 4, with 8 parallel executions, the time was 1.00 seconds.

Well, it feels sufficiently fast, to say the least.

In the case of tensor networks, not only the number of qubits but also the depth parameters are considered, so let's estimate the overall circuit size and compute accordingly. That's all.