7

The background of this question is a generative process called reverse diffusion process, where one starts with a data distribution $x_0\sim p_{\rm data}(x_0)$ (each sample lies in $\mathbb{R}^D$) and defines a Markov chain (called diffusion process) $x_0,x_1,\cdots,x_T$ with $T$ sufficiently large, where the transitions are $$p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI),\quad\beta_t\in(0,1).$$ The generative process learns to reverse the diffusion process in order to model $p_{\rm data}(x_0)$. An assumption is made that $p(x_T)=\mathcal N(0,I)$, so that the reverse process can start from $\mathcal N(0,I)$, from which numerical sampling is rather easy.

My question is whether this assumption is mathematically valid: does $p(x_T)$ tend to $\mathcal N(0,I)$ when $T\to\infty$?

Intuitively this makes sense because:

  • Each transition adds some Gaussian noise to the previous one; it makes sense for the limiting distribution (if there is one) to be completely Gaussian.
  • $\mathcal N(0,I)$ is invariant under transitions of the form $p(x'|x)=\mathcal N(\sqrt{1-\beta}x,\beta I)$: $$p(x')=\int p(x'|x)p(x){\rm d}x=\int\frac{1}{(2\pi\beta)^{D/2}}e^{-|x'-\sqrt{1-\beta}x|^2/(2\beta)}\frac{1}{(2\pi)^{D/2}}e^{-|x|^2/2}{\rm d}x=\frac{1}{(2\pi)^{D/2}}e^{-|x'|^2/2}$$ $$\implies x'\sim\mathcal N(0,I).$$

However I cannot prove that the limiting distribution is indeed $\mathcal N(0,I)$. Any help is appreciated.

trisct
  • 5,373
  • 1
    Really interested in this question! (I have the same question too) Can we use some "contraction" property of the transition probability to show it's getting closer and closer to Gaussian ? – Binxu Wang 王彬旭 Jun 08 '22 at 04:18
  • 1
    @BinxuWang王彬旭 I had the same thought. Unfortunately I don't know enough theorems in the random process theory. – trisct Jun 08 '22 at 04:19

3 Answers3

4

Let $y_t = (x_t-\sqrt{1-\beta_t}x_{t-1})/\sqrt{\beta_t}$ for $t \ge 1$. By construction, $y_t$ the conditional law of $y_t$ given $(x_0,\ldots,x_{t-1})$ is $\mathcal{N}(0,I)$. Hence $y_t$ is independent of $(x_0,\ldots,x_{t-1})$ (therefore independent of $(x_0,y_1,\ldots,y_{t-1})$ and the distribution of $y_t$ is $\mathcal{N}(0,I)$. By recursion, $x_0,y_1,\ldots,y_{t-1},y_t$ are independent.

For every $t \ge 1$, $x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}y_t$. By recursion, $$x_t = \prod_{k=1}^t\sqrt{1-\beta_k} x_0 + \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t \sqrt{1-\beta_\ell} \Big) \sqrt{\beta_k}y_k.$$ Hence the conditional law of $x_t$ given $x_0$ is gaussian with expectation $\prod_{k=1}^t\sqrt{1-\beta_k} x_0$ and covariance matrix \begin{eqnarray*} \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t (1-\beta_\ell) \Big)\beta_k I &=& \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t (1-\beta_\ell) -\prod_{\ell = k}^t (1-\beta_\ell) \Big) I \\ &=& \Big(1 - \prod_{\ell = 1}^t (1-\beta_\ell) \Big) I. \end{eqnarray*} Here, we used the equality $\beta_k=1-(1-\beta_k)$ to get a telescoping sum.

If the series $\sum_k \beta_k$ diverges, then $\prod_{\ell = 1}^t (1-\beta_\ell) \to 0$ as $t \to +\infty$, so $\mathcal{L}(x_t|x_0) \to \mathcal{N}(0,I)$ as $t \to +\infty$.

  • 1
    Addition: as an intermediate step for the first equation in the covariance derivation, a 1 was added, $\beta_k = (1-(1-\beta_k)) $. The telescope sum was exploited for the second step where $\beta_{t+1}=1$ w.l.o.g. – Butters Feb 14 '23 at 14:53
2

For those who struggle as much as I did to understand the derivation of the covariance matrix, here's some supplementary good ol' fashioned mathematical rigour:

Start by letting $y_t = x_t - \sqrt{1-\beta_t}x_{t-1}$. Solving this for $x_t$ and expanding it recursively, we get \begin{align} x_t &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} x_{t-1} \\ &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} \left( \sqrt{\beta_{t-1}} y_{t-1} + \sqrt{1-\beta_{t-1}} \left(... + \sqrt{1-\beta_0}x_0\right)...\right)\\ &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} \sqrt{\beta_{t-1}} y_{t-1} + ... + \sqrt{1-\beta_t}\cdot...\cdot\sqrt{1-\beta_0}x_0 \end{align} We can see that the $k$th term in the sum, we have a $y_k$ and a coefficient consisting of a $\sqrt{\beta_k}$ and a product of all $\sqrt{1-\beta_l}$ where $l = k+1 ... t$.

Note that for the term where $k=t$, we have a subtlety; $\beta_{t+1}$ is not defined. This term does instead only contain the coefficient $\sqrt{\beta_t}$. One way of formalizing this is to define $$ \alpha_k = \left\{ \begin{array}{ll} \beta_k, & k \leq t\\ \frac{\beta_t}{\beta_t - 1}, & k = t+1 \end{array} \right. $$ For the $t$th term, we then get the coefficient $$ \sqrt{\beta_t} \sqrt{1-\alpha_t}\sqrt{1-\alpha_{t+1}} = \sqrt{\beta_t} \frac{\sqrt{1-\beta_t}}{\sqrt{1-\beta_t}} = \sqrt{\beta_t} $$

This results in the series \begin{align} x_t &= \prod_{k=1}^t\sqrt{1-\beta_k}x_0 + \sum_{k=1}^t \sqrt{\alpha_k} y_k \prod_{l=k+1}^t \sqrt{1-\alpha_l} \end{align}

The conditional distribution of $x_t$ given $x_0$ then has a covariance matrix with contributions only from $$ \sum_{k=1}^t \sqrt{\alpha_k} y_k \prod_{l=k+1}^t \sqrt{1-\alpha_l} $$

As explained in Christophe's answer, the $y_k$s are iid with unit variance. This gives \begin{align} Cov(x_t|x_0) &= I \sum_{k=1}^t \left( {\alpha_k} \prod_{l=k+1}^t (1-\alpha_l)\right) \end{align} Adding and subtracting $\prod_{l=k+1}^t (1-\alpha_l)$ inside the parentheses gives \begin{align} Cov(x_t|x_0) &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) -\prod_{l=k+1}^t (1-\alpha_l) + {\alpha_k} \prod_{l=k+1}^t (1-\alpha_l) \right) \\ &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) - (1-\alpha_k)\prod_{l=k+1}^t (1-\alpha_l) \right) \\ &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) - \prod_{l=k}^t (1-\alpha_l) \right) \end{align} Notice that all the $\prod_{l=k+1}^t (1-\alpha_l)$ will cancel out with the $- \prod_{l=k}^t (1-\alpha_l)$ for the next term in the sum. We are therefore only left with $- \prod_{l=k}^t (1-\alpha_l)$ for $k=1$ and $\prod_{l=k+1}^t (1-\alpha_l)$ for $k=t$. Note that this trick cannot be applied to infinite series, but in this case it is fine, as $t$ is finite, and we are only looking at what happens when $t$ increases. This results in \begin{align} Cov(x_t|x_0) &= I \left(\prod_{l=t+1}^t (1-\alpha_l) - \prod_{l=1}^t (1-\alpha_l)\right) \\ &= I \left((1-\alpha_{t+1})(1-\alpha_t) - \prod_{l=1}^t (1-\alpha_l) \right) \\ &= I \left(\left(1-\frac{\beta_t}{\beta_t - 1}\right)(1-\beta_t) - \prod_{l=1}^t (1-\beta_l) \right)\\ &= I \left(1 - \prod_{l=1}^t (1-\beta_l) \right) \end{align} From this, we see that the only requirement for the $\beta$-schedule is that $$ \lim_{t\to\infty} \prod_{l=1}^t (1-\beta_l) = 0 $$ which is a fairly soft requirement, as 'most' infinite products with factors less than $1$ are $0$.

And so if this is satisfied, $x_t \sim \mathcal{N}(0,I)$ as $t \to \infty$.

1

I want to give an easier (compared to previous explanations), and more intuitive argument. Though, it is not intended to be mathematically rigorous.

Given the transitions,

$$p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI),\quad\beta_t\in(0,1)$$

we can simply write a sample $x_t$ from this distribution, $x_t \sim p(x_t|x_{t-1})$, as follows:

$$x_t = \sqrt{\Pi_{s=0}^t \alpha_s } x_0 + \sqrt{1-\Pi_{s=0}^t \alpha_s } \varepsilon$$

where $x_0 \sim p_{\rm{data}}(x)$, $\alpha_s := 1-\beta_s$ and $\varepsilon \sim \mathcal{N}(0,I)$. This follows from recursively applying the reparametrization trick, and can be verified in the original publication (see question).

Now, assuming that $\lim_{t\to \infty} \sqrt{\Pi_{s=0}^t \alpha_s } = 0 $, we have indeed that

$$ \lim_{t\to \infty} x_t = \varepsilon.$$

Butters
  • 153