NVIDIA CUDA-Q Tutorial & H100 Benchmark Part 1: Bernstein-Vazirani (BV) Algorithm
Starting from this post, I’ll be introducing how to use CUDA-Q by explaining the basic algorithms.
For more specific implementation details, please refer to the original documentation:
The installation process is extremely simple:
pip install cudaq
Now, let’s dive in.
Overview of the Bernstein-Vazirani Algorithm
The BV algorithm is a quantum algorithm that can identify a hidden bitstring s
with just one function call (query).
Problem Setup
You are given access to a function ( f(x) ), but the internal logic is a black box.
The function is defined based on a hidden bitstring ( s ) as follows:
This is the bitwise dot product of ( s ) and ( x ).
For example, if ( s = 101 ) and ( x = 110 ), then:
Goal
The objective is to find the hidden bitstring ( s ) using the function.
Classical Approach
Let’s consider an example with ( n = 3 ) and a hidden string ( s = s₀s₁s₂ ).
Steps
Query the function with these values of ( x ):
- ( x = 100 \Rightarrow f(100) = s₀ )
- ( x = 010 \Rightarrow f(010) = s₁ )
- ( x = 001 \Rightarrow f(001) = s₂ )
→ You can extract each bit individually, so it takes 3 queries to determine ( s ).
Quantum Approach (Bernstein-Vazirani Algorithm)
In the quantum version, you can identify ( s ) with just a single query.
It leverages quantum phenomena such as superposition, interference, and phase kickback.
In this setup, a quantum oracle is built using the hidden bitstring ( s ). Although it may seem like cheating that we "know" ( s ) inside the oracle, this is a common and accepted practice for constructing such algorithms.
The goal is still to compute ( f(x) = s \cdot x \mod 2 ), but now we encode the information into a quantum circuit based on ( s ), which allows users to infer it through measurement.
As illustrated above, classical computers must query each bit separately since they can't use superposition like quantum computers.
The BV algorithm uses a set of quantum bits corresponding to input ( x ) and an auxiliary qubit for phase kickback. The main qubits are placed into a superposition using Hadamard gates, and the auxiliary qubit is flipped to (|1\rangle) and then transformed into a superposition using a Hadamard gate.
The oracle is implemented such that if a bit in ( s ) is 1, a CX (controlled-NOT) gate is applied between the corresponding qubit and the auxiliary qubit, causing a phase flip.
Finally, another round of Hadamard gates is applied before measurement, allowing the hidden bitstring ( s ) to be reconstructed.
The key point: even with just one auxiliary qubit, the phase kickback mechanism allows the effects of multiple control qubits to be captured effectively.
For the detailed math, refer to the CUDA-Q tutorial linked above.
CUDA-Q Code Example
Let’s look at the code from the tutorial:
import cudaq
from typing import List
cudaq.set_target('qpp-cpu')
# cudaq.set_target('nvidia') # Use this for GPU acceleration
This sets up the environment. Use nvidia
if you're running on a system with NVIDIA GPUs.
qubit_count = 5
secret_string = [1, 1, 0, 1, 0]
assert qubit_count == len(secret_string)
Define the number of qubits and the hidden bitstring ( s ).
@cudaq.kernel
def oracle(register: cudaq.qview, auxiliary_qubit: cudaq.qubit,
secret_string: List[int]):
for index, bit in enumerate(secret_string):
if bit == 1:
x.ctrl(register[index], auxiliary_qubit)
The oracle applies a CX gate for each bit in ( s ) that is 1.
@cudaq.kernel
def bernstein_vazirani(secret_string: List[int]):
qubits = cudaq.qvector(len(secret_string))
auxiliary_qubit = cudaq.qubit()
x(auxiliary_qubit)
h(auxiliary_qubit)
h(qubits)
oracle(qubits, auxiliary_qubit, secret_string)
h(qubits)
mz(qubits)
This is the main kernel that sets up the quantum circuit, applies the oracle, and performs measurement.
print(cudaq.draw(bernstein_vazirani, secret_string))
result = cudaq.sample(bernstein_vazirani, secret_string)
print(f"secret bitstring = {secret_string}")
print(f"measured state = {result.most_probable()}")
print(f"Were we successful? {''.join([str(i) for i in secret_string]) == result.most_probable()}")
It prints the circuit diagram, the correct answer ( s ), the measured result, and whether the result matches.
Benchmarking
We benchmarked the BV algorithm using both CPU and GPU with the above code.
- CPU: AMD EPYC 9654 (96 cores)
- GPU: NVIDIA H100 SXM (80GB VRAM)
Both provide more than enough performance.
As shown in the graph, the CPU starts to slow down with increasing qubit count.
The GPU handles up to 30+ qubits with minimal slowdown and vastly superior performance at larger scales.