ML Series Part 1: Boltzmann Machines

In this series of blog posts, we shall be looking into different areas of classical and quantum machine learning, and talk about their applications. The aim of this series is to be a quick introduction to both ML and QML for people interested in these areas. Knowledge of some basics (stochastic gradient descent, MLP, RNN etc.) are assumed, but is not required!

Part 1: The Boltzmann Machine

Through this part, we'll discuss the classical model in the paper of Amin et. al., Phys. Rev. X, 2018. The quantum part will be discussed in Part 2 of this series.

A Boltzmann machine can be thought of as a probabilistic recurrent neural network, and has been studied in the context of classical as well as quantum machine learning. Its basic architecture is a graph $G=(V_{\textit{in}} \sqcup V_{\textit{out}}\sqcup V_{\textit{hidden}}, E)$, where the vertex set is partitioned into three parts corresponding to input, output and hidden nodes respectively. For notational convenience, let's write $V(G)={1,2,\ldots, n}$. Each node $v$ can contain the values $\alpha(v)\in {\pm 1}$, and the network parameters are weights ${w_{ij}: 1\le i<j\le n}$.

Say we fix a node $v$, then the probability that $v$ contains $1$ depends on the values of its neighbors:

$$\mathbb P(\alpha(v) = 1 | N(v)) = \frac 1{1+\exp(-\sum_{u\in N(v)} w_{ij}\cdot \alpha(v)\alpha(u))}.\qquad\qquad (1)$$

To further simplify the training and probabilities, an additional assumption on the weights and edges is made: it is assumed that $G$ is a bipartite graph with parts $A=V_{\textit{in}}\cup V_{\textit{out}}$ and $B=V_{\textit{hidden}}$. In other words, the only parameters in the network now are the weights $W = {w_{ij}: 1\le i < j \le n, i\in A, j\in B}$. All other weights are fixed to $0$. Now, the probability distribution of the different states of the network is given as follows.

Fix a ${\pm 1}$-valued binary string $s$ of length $n$. Let $E(s, W) = -\sum_{i<j} w_{ij} s_is_j -\sum_{i\in A}a_is_i -\sum_{j\in B}b_js_j$ denote the Ising energy of the system. We then get a probability distribution:

$$\mathbb P(\alpha(v)=s | W) = \exp(-E(s,W))/ \left(\sum_{s\in {\pm 1}^n} \exp(-E(s,W))\right).$$

This distribution is known as the Boltzmann distribution in statistical physics, which is where the name of this ML model comes from! We denote the denominator with $Z$ for notational convenience.

Note that here, only the vertices in $A$ are visible, and $B$ contains vertices of the hidden layer. Therefore, we can marginalize the PDF further:

$$\mathbb P(\alpha(A)=s | W) = \sum_{h\in {\pm 1}^{|B|}} \mathbb P(\alpha(v) = s+h | W)$$

Where $s+h$ denotes concatenation of strings.

Training the network?

We start with random weights $W$, and start training by minimizing the loss function

$$L(W) = \sum_{m=1}^M \log \mathbb P (\alpha(v^m) | W),$$

where $v^m$ is the $m$'th point in the training set. We shall use stochastic gradient descent to decrease $L(W)$. To that end, computing the gradient $\frac{\partial L(W)}{\partial w_{ij}}$ leads us to the following final expression:

$$\frac{\partial L(W)}{\partial w_{ij}}= \sum_{m=1}^M\sum_{h\in {\pm 1}^{|B|}}\frac{\exp(-E(v^m, W))}{Z_D}\alpha(v_i^m)\alpha(h_j) - \sum_{s\in{\pm 1}^n} \frac{\exp(-E(s;W))}{Z}\alpha(v_i)\alpha(h_j).$$

Here, $Z_D$ denotes the partition function $\sum_{h,m}\exp(-E(v^m+h, W))$ computed only over the dataset. This means that the first term in the partial derivative is easy to compute given the dataset. The second term can be thought of as $\langle v_ih_j\rangle$: a mean of $v_ih_j$ over the model. Therefore, it is sampled using Markov Chain Monte Carlo (MCMC).

Sampling algorithm

  • Set visible nodes $A$ to a randomly picked training sample. Let this initial set be $v^{(0)}$.
  • Sample the hidden nodes $h\in B$ using the probability density of $(1)$. Let this be $h^{(0)}$
  • Fixing the values of the nodes $B$, again sample nodes $A$ using the probability density of $(1)$. Let this be $v^{(1)}$.
  • Continue this process for $T$ steps. Suppose the final state is $v^{(T)}$ for visible nodes and $h^{(T)}$ for hidden nodes.
  • The gradient is then given by $\frac{\partial L(W)}{\partial w_{ij}} \approx \alpha(v^{(0)}_i)\alpha(h^{(0)}_j) - \alpha(v^{(T)}_i)\alpha(h^{(T)}_j)$.

The gradient is used to perform a gradient descent on the weights $W$. This algorithm is known as Contrastive Divergence, proposed by Hinton, and has been used extensively to train deep neural networks successfully.

In the next post, we shall see the different quantum models for the Boltzmann machine!

Sayan Mukherjee
Sayan Mukherjee
Related posts

blueqat Inc.

Shibuya Scramble Square 39F 2-24-12, Shibuya, Shibuya-ku, Tokyo