At the end of 2023, there is a trend towards consistent models that can execute diffusion models even faster. I would like to get a quick overview of the model.
Diffusion Models
First, let's have a quick review of the diffusion model.
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, Pieter Abbeel
The diffusion model is a model that gradually removes noise from an image to restore it, starting from an initial state xT and working its way back to x0. The commonly used approach to this restoration involves using a neural network called U-Net. However, in this case, we want to take a closer look at an extended version of the diffusion model known as a consistency model.
The diffusion model begins with the initial noisy state, which typically starts from the following Gaussian distribution:
p(xT)=N(xT;0,I)
The intermediate stages of this process involve...
pθ(xt−1∣xt):=N(xt−1;μθ(xt,t),θ∑(xt,t))
It is defined as such, and the final restored state is obtained through this process. It is often referred to as the inverse diffusion process.
pθ(x0:T):=p(xT)t=1∏Tpθ(xt−1∣xt)
As per the formula, it appears that the image is restored sequentially from the initial state, with states being probabilistically determined along the way.
Consistency Models
This commonly seen description of the diffusion model appears to be discretized in terms of time steps. The introduction of a continuous-time diffusion model, which captures the function of time continuously, serves as the starting point for the development of the consistency model, a new model. Utilizing this new model, the theoretical foundation has been established for rapidly generating numerous images, such as in recent image generation tasks, at a high-speed rate of one second per image.
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole https://arxiv.org/abs/2011.13456
The continuous-time diffusion model
The continuous-time diffusion model can be expressed in the form of the following stochastic differential equation:
dxt=μ(xt,t)+σ(t)dwt
A stochastic differential equation is a differential equation in which one or more terms are stochastic processes, leading to the result that the solution itself becomes a stochastic process. (From Wikipedia)
Brownian motion and Wiener process
The above equation is written in the form of a Wiener process, which is a mathematical model for Brownian motion.
Brownian motion is a phenomenon where small particles suspended in a liquid or gas (e.g., colloids) exhibit irregular (random) motion. The Wiener process is a probability process that becomes the limit of a discrete random walk.(From Wikipedia)
Xt=μt+σWt
This equation is referred to as a Wiener process, characterized by a drift term μ and an infinitesimal variance σ^2 (apparently).
Fokker-Planck equation and ordinary differential equation
This stochastic differential equation (SDE) can be rewritten in the form of an ordinary differential equation (ODE).
An ordinary differential equation (ODE) is a type of differential equation in mathematics defined by an equation involving an unknown function and its derivatives. It specifically applies to differential equations where the unknown function essentially depends on only one variable.
In this document, a derivation (?) of the Fokker-Planck equation was presented.
The Fokker-Planck equation, in statistical mechanics, refers to the following equation with no terms for n ≥ 3 as mentioned in the Kramers-Moyal equation.
Probability Flow ordinary differential equation
In the end, it is organized as follows:
dxt=[μ(xt,t)−21σ(t)2∇logpt(xt)]dt
In the form of this equation, ∇logpt(x) is referred to as the score function. This equation is known as the Probability Flow ODE (Ordinary Differential Equation).
The score function is the gradient of the natural logarithm of the likelihood function (from Wikipedia).
Empirical PF ODE
When further transforming the above equation, it can be noted that the generative AI utilizes pT(x), which is close to the Gaussian distribution π(x). By substituting μ(x,t)=0 and σ(t)=2t, and denoting the score function as sϕ(x,t)...
dtdxt=−tsϕ(xt,t)
And it can be expressed as such. The final distribution of noise is initialized with π=N(0,T2I).
Solving the PF ODE provides the answer, but using existing solvers or the original inverse diffusion process to restore images can be time-consuming due to the number of steps involved. Therefore, in the paper, a model called the consistency model is proposed.
Consistency Models
In solving the PF ODE, this time we assume the existence of a function called the consistency function. This function relates to a set of {xt}t∈[ϵ,T] as follows:
f:(xt,t)↦xϵ
This consistency function is characterized by self-consistency, meaning that when following a trajectory of a single PF ODE, regardless of which point (xt,t) along that trajectory is chosen, it always converges to the same final answer. Consequently, it becomes a function that ensures reaching the final state (x0,0) from any arbitrary point. A well-known (?) schematic diagram depicting this concept is shown below, where choosing a trajectory determines the endpoint.
As a result, the Latent Consistency Model has emerged, enabling the rapid generation of images in a single step, departing from the previous sequential step-by-step process seen in diffusion models. I'd like to touch upon aspects like learning and more in the future.