common.title

Overview
Service overview
Terms of service

Privacy policy

Contact

Sign in
Sign up
common.title

ML Series Part 1: Boltzmann Machines

Sayan Mukherjee

2021/09/06 07:46

#algorithm #machine learning

1

In this series of blog posts, we shall be looking into different areas of classical and quantum machine learning, and talk about their applications. The aim of this series is to be a quick introduction to both ML and QML for people interested in these areas. Knowledge of some basics (stochastic gradient descent, MLP, RNN etc.) are assumed, but is not required!

Part 1: The Boltzmann Machine

Through this part, we'll discuss the classical model in the paper of Amin et. al. , Phys. Rev. X, 2018. The quantum part will be discussed in Part 2 of this series.

A Boltzmann machine can be thought of as a probabilistic recurrent neural network, and has been studied in the context of classical as well as quantum machine learning. Its basic architecture is a graph G=(VinVoutVhidden,E)G=(V_{\textit{in}} \sqcup V_{\textit{out}}\sqcup V_{\textit{hidden}}, E), where the vertex set is partitioned into three parts corresponding to input, output and hidden nodes respectively. For notational convenience, let's write V(G)={1,2,,n}V(G)=\{1,2,\ldots, n\}. Each node vv can contain the values α(v){±1}\alpha(v)\in \{\pm 1\}, and the network parameters are weights {wij:1i<jn}\{w_{ij}: 1\le i<j\le n\}.

Say we fix a node vv, then the probability that vv contains 11 depends on the values of its neighbors:

P(α(v)=1N(v))=11+exp(uN(v)wijα(v)α(u)).(1)\mathbb P(\alpha(v) = 1 | N(v)) = \frac 1{1+\exp(-\sum_{u\in N(v)} w_{ij}\cdot \alpha(v)\alpha(u))}.\qquad\qquad (1)

To further simplify the training and probabilities, an additional assumption on the weights and edges is made: it is assumed that GG is a bipartite graph with parts A=VinVoutA=V_{\textit{in}}\cup V_{\textit{out}} and B=VhiddenB=V_{\textit{hidden}}. In other words, the only parameters in the network now are the weights W={wij:1i<jn,iA,jB}W = \{w_{ij}: 1\le i < j \le n, i\in A, j\in B\}. All other weights are fixed to 00. Now, the probability distribution of the different states of the network is given as follows.

Fix a {±1}\{\pm 1\}-valued binary string ss of length nn. Let E(s,W)=i<jwijsisjiAaisijBbjsjE(s, W) = -\sum_{i<j} w_{ij} s_is_j -\sum_{i\in A}a_is_i -\sum_{j\in B}b_js_j denote the Ising energy of the system. We then get a probability distribution:

P(α(v)=sW)=exp(E(s,W))/(s{±1}nexp(E(s,W))).\mathbb P(\alpha(v)=s | W) = \exp(-E(s,W))/ \left(\sum_{s\in \{\pm 1\}^n} \exp(-E(s,W))\right).

This distribution is known as the Boltzmann distribution in statistical physics, which is where the name of this ML model comes from! We denote the denominator with ZZ for notational convenience.

Note that here, only the vertices in AA are visible, and BB contains vertices of the hidden layer. Therefore, we can marginalize the PDF further:

P(α(A)=sW)=h{±1}BP(α(v)=s+hW)\mathbb P(\alpha(A)=s | W) = \sum_{h\in \{\pm 1\}^{|B|}} \mathbb P(\alpha(v) = s+h | W)

Where s+hs+h denotes concatenation of strings.

Training the network?

We start with random weights WW, and start training by minimizing the loss function

L(W)=m=1MlogP(α(vm)W),L(W) = \sum_{m=1}^M \log \mathbb P (\alpha(v^m) | W),

where vmv^m is the mm'th point in the training set. We shall use stochastic gradient descent to decrease L(W)L(W). To that end, computing the gradient L(W)wij\frac{\partial L(W)}{\partial w_{ij}} leads us to the following final expression:

L(W)wij=m=1Mh{±1}Bexp(E(vm,W))ZDα(vim)α(hj)s{±1}nexp(E(s;W))Zα(vi)α(hj).\frac{\partial L(W)}{\partial w_{ij}}= \sum_{m=1}^M\sum_{h\in \{\pm 1\}^{|B|}}\frac{\exp(-E(v^m, W))}{Z_D}\alpha(v_i^m)\alpha(h_j) - \sum_{s\in\{\pm 1\}^n} \frac{\exp(-E(s;W))}{Z}\alpha(v_i)\alpha(h_j).

Here, ZDZ_D denotes the partition function h,mexp(E(vm+h,W))\sum_{h,m}\exp(-E(v^m+h, W)) computed only over the dataset. This means that the first term in the partial derivative is easy to compute given the dataset. The second term can be thought of as vihj\langle v_ih_j\rangle: a mean of vihjv_ih_j over the model. Therefore, it is sampled using Markov Chain Monte Carlo (MCMC).

Sampling algorithm

    • Set visible nodes AA to a randomly picked training sample. Let this initial set be v(0)v^{(0)}.
    • Sample the hidden nodes hBh\in B using the probability density of (1)(1). Let this be h(0)h^{(0)}
    • Fixing the values of the nodes BB, again sample nodes AA using the probability density of (1)(1). Let this be v(1)v^{(1)}.
    • Continue this process for TT steps. Suppose the final state is v(T)v^{(T)} for visible nodes and h(T)h^{(T)} for hidden nodes.
    • The gradient is then given by L(W)wijα(vi(0))α(hj(0))α(vi(T))α(hj(T))\frac{\partial L(W)}{\partial w_{ij}} \approx \alpha(v^{(0)}_i)\alpha(h^{(0)}_j) - \alpha(v^{(T)}_i)\alpha(h^{(T)}_j).

The gradient is used to perform a gradient descent on the weights WW. This algorithm is known as Contrastive Divergence, proposed by Hinton , and has been used extensively to train deep neural networks successfully.

In the next post, we shall see the different quantum models for the Boltzmann machine!

© 2024, blueqat Inc. All rights reserved