Recently, the use of GPUs has been increasing, so I will be taking benchmarks bit by bit. Below, I have posted the difference in speed between the T4 and V100. For the V100, I used a 4 GPU node. I have previously posted the speed difference between the CPU and T4 on my blog. Depending on the machine, there was a noticeable difference starting from around 15 qubits.
It's tough to go any further with just the CPU, so for 20-29 qubits, I'd like to compare speeds between GPUs. Below is a single T4 node and a V100 node equipped with 4 GPUs that supports NVLINK. In the comparison between a single V100 and T4, the T4 was holding its own quite well. This time, I used a V100 node compatible with the 4GPU cuQuantum Appliance and conducted a simulation on a single node.
From around 22 qubits, there's a significant difference in speed. For the T4, from 20 qubits to 29 qubits, ...
[0.048171281814575195, 0.0784001350402832, 0.14328503608703613, 0.2778923511505127, 0.5529191493988037, 1.1201775074005127, 2.2969729900360107, 4.7476561069488525, 9.820889472961426, 20.347250938415527] (sec)
For the V100 with 4 GPUs, from 20 qubits to 29 qubits, ...
[0.01749444007873535, 0.035036563873291016, 0.019758224487304688, 0.02976536750793457, 0.0515437126159668, 0.09291481971740723, 0.1865699291229248, 0.4053304195404053, 0.7974536418914795, 1.6286036968231201] (sec)
From around 22 qubits, there's about an order of magnitude difference in speed. Using the T4 for more than 29 qubits seems challenging now. With 16GB of VRAM, it's user-friendly, making it easy to use for beginners. With NVIDIA cuQuantum, it seems we can achieve fast computations in FTQC and NISQ. This time, since we used multiple GPUs, we utilized the cuQuantum Appliance. That's all.