NVIDIA CUDA-Q チュートリアル＆H100ベンチマーク８：コアセットによる分割型クラスタリング

このチュートリアルは、エディンバラ大学とNVIDIAが共同開発したもので、量子シミュレーションによる分割型クラスタリング手法を紹介します。

参考論文：
https://arxiv.org/pdf/2402.01529.pdf

コード：

従来の量子クラスタリングは量子ビット数が多く必要でしたが、本手法ではコアセットを活用することで、少ない量子ビットでも実行可能に。CUDA-Qによりスケーラブルな量子シミュレーションが可能になっています。

分割型クラスタリングは、全データを1つのクラスタから出発し、繰り返し2分割していく方法で、データの類似性を構造的に捉えるのに有効です。

計算には

CPU: AMD EPYC9654
GPU: NVIDIA H100

を利用しました。ところどころ元のチュートリアルにはない計算時間の表記を実際のベンチマークをとって入れました。

インストールやディレクトリやサンプルファイルに指定があります。少しmpi4yのインストールでの依存環境で時間を取られました。

!mkdir divisive_clustering_src
!wget -P divisive_clustering_src https://raw.githubusercontent.com/NVIDIA/cuda-quantum/main/docs/sphinx/applications/python/divisive_clustering_src/divisive_clustering.py
!wget -P divisive_clustering_src https://raw.githubusercontent.com/NVIDIA/cuda-quantum/main/docs/sphinx/applications/python/divisive_clustering_src/main_divisive_clustering.py

!pip install mpi4py==3.1.6 networkx==2.8.8 pandas==2.2.2 scikit-learn==1.4.2 tqdm==4.66.2 -q

準備ができたら始めます。
最初にツールを読み込みます。

import cudaq
from cudaq import spin

# Auxillary Imports
import os
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from typing import Tuple
from divisive_clustering_src.divisive_clustering import Coreset, DivisiveClustering, Dendrogram, Voironi_Tessalation

warnings.filterwarnings("ignore")

下記はパラメータ類でユーザーが自由に変更できます。

circuit_depth = 1
max_iterations = 75
max_shots = 1000
np.random.seed(10)

次にちょっと方針について確認です。

あるデータセットに対して、**コアセット（coreset）**とは、そのデータセット全体を十分に代表できるように選ばれた、重み付きでサイズのはるかに小さいサブセットのことです（サイズ 𝑚≪𝑛）。コアセットを使って分析することで、元のデータセット全体に対する近似的で妥当な結論を導くことができます。

コアセット構築とは「ある許容誤差内で最適なサイズと重みのコアセットを見つける」という問題です。

量子コンピュータの制約を考慮して、本研究ではコアセットサイズ 𝑚をある固定値に設定し、各モデルに対してその誤差を評価しています。

以下は、1000個のデータポイントから構成される元データセットから構築された、20点のコアセットの例です（pandasのデータフレームに読み込み済み）。下の図では、コアセットの点が黒い星で表されており、星の大きさが重みに対応しています。

raw_data = Coreset.create_dataset(1000)
coreset = Coreset(
    raw_data=raw_data,
    number_of_sampling_for_centroids=10,
    coreset_size=10,
    number_of_coresets_to_evaluate=4,
    coreset_method="BFL2",
)

coreset_vectors, coreset_weights = coreset.get_best_coresets()

coreset_df = pd.DataFrame({
    "X": coreset_vectors[:, 0],
    "Y": coreset_vectors[:, 1],
    "weights": coreset_weights
})
coreset_df["Name"] = [chr(i + 65) for i in coreset_df.index]
print(coreset_df)

plt.scatter(raw_data[:, 0], raw_data[:, 1], label="Raw Data", c="#7eba00")
plt.scatter(
    coreset_df["X"],
    coreset_df["Y"],
    s=coreset_df["weights"],
    label="Coreset",
    color="black",
    marker="*",
)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Raw data and its best 10 coreset using BFL2")
plt.legend()
plt.show()

チュートリアルとは少し違う感じで出ました。

実装

データ前処理

量子コンピュータ上でクラスタリングを行うには、タスクをバイナリ最適化問題の形式に変換する必要があります。
各量子ビット（qubit）はコアセットの1点を表し、量子アルゴリズムは分割クラスタリングの各ステップでコアセット点をどのように二分割するかを決定します。

def get_K2_Hamiltonian(G: nx.Graph) -> cudaq.SpinOperator:
    """Returns the K2 Hamiltonian for the given graph G

    Args:
        G (nx.Graph): Weighted graph
    """
    H = 0

    for i, j in G.edges():
        weight = G[i][j]["weight"]
        H += weight * (spin.z(i) * spin.z(j))

    return H

最初のステップでは、コアセットの各点を**完全グラフ（fully connected graph）**に変換します。
各エッジの重みは、以下のように計算されます：

エッジの重み = 点 i と j のユークリッド距離（Euclidean distance）

この処理は Coreset.coreset_to_graph() によって行われ、重み付き完全グラフ𝐺が返されます。

量子での解き方

最終的には組合せ最適化問題をVQE(VQA)で解くようです。

def get_VQE_circuit(number_of_qubits: int, circuit_depth: int) -> cudaq.Kernel:
    """Returns the VQE circuit for the given number of qubits and circuit depth

    Args:
        number_of_qubits (int): Number of qubits
        circuit_depth (int): Circuit depth

    Returns:
        cudaq.Kernel: VQE Circuit
    """

    @cudaq.kernel
    def kernel(thetas: list[float], number_of_qubits: int, circuit_depth: int):
        """VQE Circuit

        Args:
            thetas (list[float]): List of parameters
            number_of_qubits (int): Number of qubits
            circuit_depth (int): Circuit depth
        """
        qubits = cudaq.qvector(number_of_qubits)

        theta_position = 0

        for i in range(circuit_depth):
            for j in range(number_of_qubits):
                ry(thetas[theta_position], qubits[j])
                rz(thetas[theta_position + 1], qubits[j])

                theta_position += 2

            for j in range(number_of_qubits - 1):
                cx(qubits[j], qubits[j + 1])

            for j in range(number_of_qubits):
                ry(thetas[theta_position], qubits[j])
                rz(thetas[theta_position + 1], qubits[j])

                theta_position += 2

    return kernel

次に5量子ビット回路をずししてみます。

parameter_count = 4 * circuit_depth * 5
parameters = np.random.rand(parameter_count)

circuit = get_VQE_circuit(5, circuit_depth)
print(cudaq.draw(circuit, parameters, 5, circuit_depth))

次のステップは、**古典的なオプティマイザ（最適化アルゴリズム）**を選択することです。
CUDA-Qには複数の組み込みオプティマイザが用意されており、選択が可能です。

以下のコードは、初期パラメータの数に応じて適切なオプティマイザを返す処理を行います。

def get_optimizer(optimizer: cudaq.optimizers.optimizer, max_iterations,
                  **kwargs) -> Tuple[cudaq.optimizers.optimizer, int]:
    """Returns the optimizer with the given parameters

    Args:
        optimizer (cudaq.optimizers.optimizer): Optimizer
        max_iterations (int): Maximum number of iterations
        **kwargs: Additional arguments

    Returns:
        tuple(cudaq.optimizers.optimizer, int): Optimizer and parameter count
    """
    parameter_count = 4 * kwargs["circuit_depth"] * kwargs["qubits"]
    initial_params = np.random.uniform(-np.pi / 8.0, np.pi / 8.0,
                                       parameter_count)
    optimizer.initial_parameters = initial_params

    optimizer.max_iterations = max_iterations
    return optimizer, parameter_count

分割型クラスタリング関数（Divisive Clustering Function）

DivisiveClusteringVQA クラスは、コアセットの各点がそれぞれ1つのクラスタになるまで、繰り返し二分割（バイパーティション）を行う処理を実装したものです。

`run_divisive_clustering` 関数

この関数は、現在のイテレーションで分割対象となるコアセット点を受け取り、対応する重みを抽出してグラフを構築します。
このグラフは、次に紹介する get_counts_from_simulation 関数へ渡されます。

`get_counts_from_simulation` 関数

この関数は、量子シミュレーションの準備と実行を担当します。

入力として与えられたグラフからスピンハミルトニアンを構築
コスト関数を定義（この場合、パラメータ化された量子回路とハミルトニアンの期待値を返すラムダ関数）
→ この期待値は CUDA-Q の observe コマンドで計算され、GPUによって高速化されます

期待値が最小化された後は、最適なパラメータで構築された量子回路を CUDA-Q の sample 関数を使ってサンプリングします。
このとき得られた**ビット列（bitstring）とその出現回数（カウント）**が返されます。

得られたビット列の一部を使って正確なコストを評価し、最も良いビット列が選ばれます。
この最良のビット列を使って、コアセット点を2つのクラスタのいずれかに割り当てます。

class DivisiveClusteringVQA(DivisiveClustering):

    def __init__(
        self,
        **kwargs,
    ):
        super().__init__(**kwargs)

    def run_divisive_clustering(
        self,
        coreset_vectors_df_for_iteration: pd.DataFrame,
    ):
        """Runs the Divisive Clustering algorithm

        Args:
            coreset_vectors_df_for_iteration (pd.DataFrame): Coreset vectors for the iteration

        Returns:
            str: Best bitstring

        """
        coreset_vectors_for_iteration_np, coreset_weights_for_iteration_np = (
            self._get_iteration_coreset_vectors_and_weights(
                coreset_vectors_df_for_iteration))

        G = Coreset.coreset_to_graph(
            coreset_vectors_for_iteration_np,
            coreset_weights_for_iteration_np,
            metric=self.coreset_to_graph_metric,
        )

        counts = self.get_counts_from_simulation(
            G,
            self.circuit_depth,
            self.max_iterations,
            self.max_shots,
        )

        return self._get_best_bitstring(counts, G)

    def get_counts_from_simulation(self, G: nx.graph, circuit_depth: int,
                                   max_iterations: int,
                                   max_shots: int) -> cudaq.SampleResult:
        """
        Runs the VQA simulation

        Args:
            G (nx.graph): Graph
            circuit_depth (int): Circuit depth
            max_iterations (int): Maximum number of iterations
            max_shots (int): Maximum number of shots

        Returns:
            cudaq.SampleResult: Measurement from the experiment
        """

        qubits = len(G.nodes)
        Hamiltonian = self.create_Hamiltonian(G)
        optimizer, parameter_count = self.optimizer_function(
            self.optimizer,
            max_iterations,
            qubits=qubits,
            circuit_depth=circuit_depth)

        kernel = self.create_circuit(qubits, circuit_depth)

        def objective_function(
            parameter_vector: list[float],
            hamiltonian: cudaq.SpinOperator = Hamiltonian,
            kernel: cudaq.Kernel = kernel,
        ) -> float:
            """

            Objective function that returns the cost of the simulation

            Args:
                parameter_vector (List[float]):
                hamiltonian (cudaq.SpinOperator): Circuit parameter values as a vector
                kernel (cudaq.Kernel) : Circuit configuration

            Returns:
                float: Expectation value of the circuit

            """

            get_result = lambda parameter_vector: cudaq.observe(
                kernel, hamiltonian, parameter_vector, qubits, circuit_depth
            ).expectation()

            cost = get_result(parameter_vector)

            return cost

        energy, optimal_parameters = optimizer.optimize(
            dimensions=parameter_count, function=objective_function)

        counts = cudaq.sample(kernel,
                              optimal_parameters,
                              qubits,
                              circuit_depth,
                              shots_count=max_shots)

        return counts

DivisiveClusteringVQA クラスのインスタンスは、これまでに説明してきたハミルトニアンや量子回路を構築する関数などの変数を使って構成されます。
また、量子シミュレーションに関するパラメータもここで指定できます（例：circuit_depth や max_shots など）。

threshold_for_max_cut パラメータは、量子コンピュータから得られたサンプル結果のうち、何パーセントを「最良のビット列」探索に使用するかを指定します。

その他のオプションとして、データが正規化されているかどうかや、グラフの重みの計算方法など、高度な設定も指定可能です。

最後に、get_divisive_sequence メソッドを呼び出すことでクラスタリング処理のイテレーションが実行され、結果が出力されます。

optimizer = cudaq.optimizers.COBYLA()

divisive_clustering = DivisiveClusteringVQA(
    circuit_depth=circuit_depth,
    max_iterations=max_iterations,
    max_shots=max_shots,
    threshold_for_max_cut=0.75,
    create_Hamiltonian=get_K2_Hamiltonian,
    optimizer=optimizer,
    optimizer_function=get_optimizer,
    create_circuit=get_VQE_circuit,
    normalize_vectors=True,
    sort_by_descending=True,
    coreset_to_graph_metric="dist",
)

hierarchial_clustering_sequence = divisive_clustering.get_divisive_sequence(
    coreset_df)

実行結果です。

100%|██████████| 114/114 [00:00<00:00, 21090.85it/s]
100%|██████████| 19/19 [00:00<00:00, 66576.25it/s]
100%|██████████| 15/15 [00:00<00:00, 8217.68it/s]
100%|██████████| 3/3 [00:00<00:00, 38956.38it/s]
100%|██████████| 2/2 [00:00<00:00, 29537.35it/s]

実行時間：
0.6150634288787842秒

データの可視化もできます。

dendo = Dendrogram(coreset_df, hierarchial_clustering_sequence)
dendo.plot_dendrogram(plot_title="Dendrogram of Coreset using VQE")

デンドログラムの各分岐はクラスタリングの各ステップを表しており、最初の分割ほど複雑で、後半は単純な2点の分割になります。

一見おかしなクラスタ分けに見える場合もありますが、主な理由は以下の2つです：

量子サンプリングの誤差や確率性（ショット数が不十分な可能性）

コアセットの重みの影響（重みが小さい点は除外されることがある）

違和感のある点は、重みが小さいことが多いです。

Dendrogram.plot_hierarchial_split(hierarchial_clustering_sequence, coreset_df)

階層型クラスタリングは、枝に垂直な線を引くことでフラットクラスタリングに変換できます。
線と交差するデータ点は同じクラスタに属するとみなされます。

以下の関数は、高さ1.5をしきい値としてクラスタリングを実行します。
クラスタ数を指定してクラスタリングしたい場合は、dendo.get_clusters_using_k() メソッドを使い、希望のクラスタ数を引数に渡します。

図では、高さ1.5で形成されたクラスタが示されています。

threshold_height = 1
clusters = dendo.get_clusters_using_height(threshold_height)
colors = ["red", "blue", "green", "black", "purple", "orange", "yellow"]
dendo.plot_dendrogram(
    plot_title="Dendrogram of Coreset using VQE",
    colors=colors,
    clusters=clusters,
    color_threshold=threshold_height,
)

フラットクラスタは、dendo.plot_clusters() メソッドを使って可視化できます。
この関数には、クラスタ情報と色のリストを引数として渡します。
各クラスタは異なる色で表示されます。

dendo.plot_clusters(clusters,
                    colors,
                    plot_title="Clusters of Coreset using VQE",
                    show_annotation=True)

dendo.get_voronoi_tessalation() メソッドを使うと、クラスタを**領域（ボロノイ分割）**として可視化できます。
この関数には、coreset_df、clusters、colors を引数として渡します。

各コアセット点ごとに領域を作成し、クラスタごとに指定された色で塗り分けられます。
クラスタの重心（centroid）を使って領域を作成したい場合は、tesslation_by_cluster=True を指定します。

領域の作成が完了したら、plot_voronoi() 関数を使って領域を描画できます（引数はクラスタと色）。

vt = Voironi_Tessalation(coreset_df,
                         clusters,
                         colors,
                         tesslation_by_cluster=False)
vt.plot_voironi(plot_title="Voironi Tessalation of Coreset using VQE",
                show_annotation=True)

今回の環境では、numpyのバージョンの影響で描画ができませんでした。変更するには元のファイルを書き換える必要があり、今回は割愛させていただきます。

QAOA

同じ問題に対して QAOA（量子近似最適化アルゴリズム）を用いた解析に切り替えることも可能です。

この場合、変更が必要なのは量子カーネル（kernel）のみで、他の部分はそのまま使えます。
以下のようにカーネルを差し替えることで実装できます。

def get_QAOA_circuit(number_of_qubits, circuit_depth) -> cudaq.Kernel:
    """Returns the QAOA circuit for the given number of qubits and circuit depth


    Args:
        number_of_qubits (int): Number of qubits
        circuit_depth (int): Circuit depth

    Returns:
        cudaq.Kernel: QAOA Circuit
    """

    @cudaq.kernel
    def kernel(thetas: list[float], number_of_qubits: int, circuit_depth: int):
        qubits = cudaq.qvector(number_of_qubits)

        layers = circuit_depth

        for layer in range(layers):
            for qubit in range(number_of_qubits):
                cx(qubits[qubit], qubits[(qubit + 1) % number_of_qubits])
                rz(2.0 * thetas[layer], qubits[(qubit + 1) % number_of_qubits])
                cx(qubits[qubit], qubits[(qubit + 1) % number_of_qubits])

            rx(2.0 * thetas[layer + layers], qubits)

    return kernel


circuit = get_QAOA_circuit(5, circuit_depth)

print(cudaq.draw(circuit, np.random.rand(2 * circuit_depth), 5, circuit_depth))

def get_optimizer(optimizer: cudaq.optimizers.optimizer, max_iterations,
                  **kwargs) -> Tuple[cudaq.optimizers.optimizer, int]:
    """
    Returns the optimizer with the given parameters

    Args:
        optimizer (cudaq.optimizers.optimizer): Optimizer
        max_iterations (int): Maximum number of iterations
        **kwargs: Additional arguments

    Returns:
        tuple(cudaq.optimizers.optimizer, int): Optimizer and parameter count
    """

    parameter_count = 2 * kwargs["circuit_depth"]
    optimizer.initial_parameters = np.random.uniform(-np.pi / 8.0, np.pi / 8.0,
                                                     parameter_count)
    optimizer.max_iterations = max_iterations
    return optimizer, parameter_count

実行します。

optimizer = cudaq.optimizers.COBYLA()

divisive_clustering = DivisiveClusteringVQA(
    circuit_depth=circuit_depth,
    max_iterations=max_iterations,
    max_shots=max_shots,
    threshold_for_max_cut=0.75,
    create_Hamiltonian=get_K2_Hamiltonian,
    optimizer=optimizer,
    optimizer_function=get_optimizer,
    create_circuit=get_QAOA_circuit,
    normalize_vectors=True,
    sort_by_descending=True,
    coreset_to_graph_metric="dist",
)

hierarchial_clustering_sequence = divisive_clustering.get_divisive_sequence(
    coreset_df)

100%|██████████| 481/481 [00:00<00:00, 21411.09it/s]
100%|██████████| 48/48 [00:00<00:00, 56033.01it/s]
100%|██████████| 12/12 [00:00<00:00, 76842.21it/s]
100%|██████████| 4/4 [00:00<00:00, 50231.19it/s]
100%|██████████| 4/4 [00:00<00:00, 49932.19it/s]
100%|██████████| 4/4 [00:00<00:00, 44620.26it/s]

実行時間：
0.7091012001037598秒

NVIDIA CUDA-Q チュートリアル＆H100ベンチマーク８： コアセットによる分割型クラスタリング

Yuichiro Minato

実装