Does a Markov chain with Gaussian transitions $p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI)$ tend to $\mathcal N(0,I)$?

Question

The background of this question is a generative process called reverse diffusion process, where one starts with a data distribution $x_0\sim p_{\rm data}(x_0)$ (each sample lies in $\mathbb{R}^D$) and defines a Markov chain (called diffusion process) $x_0,x_1,\cdots,x_T$ with $T$ sufficiently large, where the transitions are $$p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI),\quad\beta_t\in(0,1).$$ The generative process learns to reverse the diffusion process in order to model $p_{\rm data}(x_0)$. An assumption is made that $p(x_T)=\mathcal N(0,I)$, so that the reverse process can start from $\mathcal N(0,I)$, from which numerical sampling is rather easy.

My question is whether this assumption is mathematically valid: does $p(x_T)$ tend to $\mathcal N(0,I)$ when $T\to\infty$?

Intuitively this makes sense because:

Each transition adds some Gaussian noise to the previous one; it makes sense for the limiting distribution (if there is one) to be completely Gaussian.
$\mathcal N(0,I)$ is invariant under transitions of the form $p(x'|x)=\mathcal N(\sqrt{1-\beta}x,\beta I)$: $$p(x')=\int p(x'|x)p(x){\rm d}x=\int\frac{1}{(2\pi\beta)^{D/2}}e^{-|x'-\sqrt{1-\beta}x|^2/(2\beta)}\frac{1}{(2\pi)^{D/2}}e^{-|x|^2/2}{\rm d}x=\frac{1}{(2\pi)^{D/2}}e^{-|x'|^2/2}$$ $$\implies x'\sim\mathcal N(0,I).$$

However I cannot prove that the limiting distribution is indeed $\mathcal N(0,I)$. Any help is appreciated.

Really interested in this question! (I have the same question too) Can we use some "contraction" property of the transition probability to show it's getting closer and closer to Gaussian ? — Binxu Wang 王彬旭, Jun 08 '22 at 04:18
@BinxuWang王彬旭 I had the same thought. Unfortunately I don't know enough theorems in the random process theory. — trisct, Jun 08 '22 at 04:19

Christophe Leuridan · Accepted Answer · 2023-02-14T15:29:50.687

Let $y_t = (x_t-\sqrt{1-\beta_t}x_{t-1})/\sqrt{\beta_t}$ for $t \ge 1$. By construction, $y_t$ the conditional law of $y_t$ given $(x_0,\ldots,x_{t-1})$ is $\mathcal{N}(0,I)$. Hence $y_t$ is independent of $(x_0,\ldots,x_{t-1})$ (therefore independent of $(x_0,y_1,\ldots,y_{t-1})$ and the distribution of $y_t$ is $\mathcal{N}(0,I)$. By recursion, $x_0,y_1,\ldots,y_{t-1},y_t$ are independent.

For every $t \ge 1$, $x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}y_t$. By recursion, $$x_t = \prod_{k=1}^t\sqrt{1-\beta_k} x_0 + \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t \sqrt{1-\beta_\ell} \Big) \sqrt{\beta_k}y_k.$$ Hence the conditional law of $x_t$ given $x_0$ is gaussian with expectation $\prod_{k=1}^t\sqrt{1-\beta_k} x_0$ and covariance matrix \begin{eqnarray*} \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t (1-\beta_\ell) \Big)\beta_k I &=& \sum_{k=1}^t \Big(\prod_{\ell = k+1}^t (1-\beta_\ell) -\prod_{\ell = k}^t (1-\beta_\ell) \Big) I \\ &=& \Big(1 - \prod_{\ell = 1}^t (1-\beta_\ell) \Big) I. \end{eqnarray*} Here, we used the equality $\beta_k=1-(1-\beta_k)$ to get a telescoping sum.

If the series $\sum_k \beta_k$ diverges, then $\prod_{\ell = 1}^t (1-\beta_\ell) \to 0$ as $t \to +\infty$, so $\mathcal{L}(x_t|x_0) \to \mathcal{N}(0,I)$ as $t \to +\infty$.

Addition: as an intermediate step for the first equation in the covariance derivation, a 1 was added, $\beta_k = (1-(1-\beta_k)) $. The telescope sum was exploited for the second step where $\beta_{t+1}=1$ w.l.o.g. — Butters, Feb 14 '23 at 14:53

ejorstad · Answer 2 · 2022-11-04T08:12:31.767

For those who struggle as much as I did to understand the derivation of the covariance matrix, here's some supplementary good ol' fashioned mathematical rigour:

Start by letting $y_t = x_t - \sqrt{1-\beta_t}x_{t-1}$. Solving this for $x_t$ and expanding it recursively, we get \begin{align} x_t &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} x_{t-1} \\ &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} \left( \sqrt{\beta_{t-1}} y_{t-1} + \sqrt{1-\beta_{t-1}} \left(... + \sqrt{1-\beta_0}x_0\right)...\right)\\ &= \sqrt{\beta_t} y_t + \sqrt{1-\beta_{t}} \sqrt{\beta_{t-1}} y_{t-1} + ... + \sqrt{1-\beta_t}\cdot...\cdot\sqrt{1-\beta_0}x_0 \end{align} We can see that the $k$th term in the sum, we have a $y_k$ and a coefficient consisting of a $\sqrt{\beta_k}$ and a product of all $\sqrt{1-\beta_l}$ where $l = k+1 ... t$.

Note that for the term where $k=t$, we have a subtlety; $\beta_{t+1}$ is not defined. This term does instead only contain the coefficient $\sqrt{\beta_t}$. One way of formalizing this is to define $$ \alpha_k = \left\{ \begin{array}{ll} \beta_k, & k \leq t\\ \frac{\beta_t}{\beta_t - 1}, & k = t+1 \end{array} \right. $$ For the $t$th term, we then get the coefficient $$ \sqrt{\beta_t} \sqrt{1-\alpha_t}\sqrt{1-\alpha_{t+1}} = \sqrt{\beta_t} \frac{\sqrt{1-\beta_t}}{\sqrt{1-\beta_t}} = \sqrt{\beta_t} $$

This results in the series \begin{align} x_t &= \prod_{k=1}^t\sqrt{1-\beta_k}x_0 + \sum_{k=1}^t \sqrt{\alpha_k} y_k \prod_{l=k+1}^t \sqrt{1-\alpha_l} \end{align}

The conditional distribution of $x_t$ given $x_0$ then has a covariance matrix with contributions only from $$ \sum_{k=1}^t \sqrt{\alpha_k} y_k \prod_{l=k+1}^t \sqrt{1-\alpha_l} $$

As explained in Christophe's answer, the $y_k$s are iid with unit variance. This gives \begin{align} Cov(x_t|x_0) &= I \sum_{k=1}^t \left( {\alpha_k} \prod_{l=k+1}^t (1-\alpha_l)\right) \end{align} Adding and subtracting $\prod_{l=k+1}^t (1-\alpha_l)$ inside the parentheses gives \begin{align} Cov(x_t|x_0) &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) -\prod_{l=k+1}^t (1-\alpha_l) + {\alpha_k} \prod_{l=k+1}^t (1-\alpha_l) \right) \\ &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) - (1-\alpha_k)\prod_{l=k+1}^t (1-\alpha_l) \right) \\ &= I \sum_{k=1}^t \left( \prod_{l=k+1}^t (1-\alpha_l) - \prod_{l=k}^t (1-\alpha_l) \right) \end{align} Notice that all the $\prod_{l=k+1}^t (1-\alpha_l)$ will cancel out with the $- \prod_{l=k}^t (1-\alpha_l)$ for the next term in the sum. We are therefore only left with $- \prod_{l=k}^t (1-\alpha_l)$ for $k=1$ and $\prod_{l=k+1}^t (1-\alpha_l)$ for $k=t$. Note that this trick cannot be applied to infinite series, but in this case it is fine, as $t$ is finite, and we are only looking at what happens when $t$ increases. This results in \begin{align} Cov(x_t|x_0) &= I \left(\prod_{l=t+1}^t (1-\alpha_l) - \prod_{l=1}^t (1-\alpha_l)\right) \\ &= I \left((1-\alpha_{t+1})(1-\alpha_t) - \prod_{l=1}^t (1-\alpha_l) \right) \\ &= I \left(\left(1-\frac{\beta_t}{\beta_t - 1}\right)(1-\beta_t) - \prod_{l=1}^t (1-\beta_l) \right)\\ &= I \left(1 - \prod_{l=1}^t (1-\beta_l) \right) \end{align} From this, we see that the only requirement for the $\beta$-schedule is that $$ \lim_{t\to\infty} \prod_{l=1}^t (1-\beta_l) = 0 $$ which is a fairly soft requirement, as 'most' infinite products with factors less than $1$ are $0$.

And so if this is satisfied, $x_t \sim \mathcal{N}(0,I)$ as $t \to \infty$.

score 1 · Answer 3 · answered Feb 15 '23 at 14:29

I want to give an easier (compared to previous explanations), and more intuitive argument. Though, it is not intended to be mathematically rigorous.

Given the transitions,

$$p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI),\quad\beta_t\in(0,1)$$

we can simply write a sample $x_t$ from this distribution, $x_t \sim p(x_t|x_{t-1})$, as follows:

$$x_t = \sqrt{\Pi_{s=0}^t \alpha_s } x_0 + \sqrt{1-\Pi_{s=0}^t \alpha_s } \varepsilon$$

where $x_0 \sim p_{\rm{data}}(x)$, $\alpha_s := 1-\beta_s$ and $\varepsilon \sim \mathcal{N}(0,I)$. This follows from recursively applying the reparametrization trick, and can be verified in the original publication (see question).

Now, assuming that $\lim_{t\to \infty} \sqrt{\Pi_{s=0}^t \alpha_s } = 0 $, we have indeed that

$$ \lim_{t\to \infty} x_t = \varepsilon.$$

Does a Markov chain with Gaussian transitions $p(x_t|x_{t-1})=\mathcal N(\sqrt{1-\beta_t}x_{t-1},\beta_tI)$ tend to $\mathcal N(0,I)$?

3 Answers3