32

I am trying to understand the proof for the Kullback-Leibler divergence between two multivariate normal distributions. On the way, a sort of trace trick is applied for the expectation of the quadratic form $$E[ (x-\mu)^T \Sigma^{-1} (x-\mu) ]= \operatorname{trace}(E[(x-\mu)(x-\mu)^T)] \Sigma^{-1}),$$

where $x$ is MV-normal with mean $\mu$ and covariance matrix $\Sigma$. The expectation is taken over $x$.

I would like to understand why this identity holds. I think more than one step is taken at once. I believe, $\operatorname{trace}(E[(x-\mu)(x-\mu)^T] \Sigma^{-1})$ = $\operatorname{trace}(E[(x-\mu) \Sigma^{-1} (x-\mu)^T])$, but where does the trace come from?

amWhy
  • 210,739
tomka
  • 968
  • 5
    The "trace trick" is the cyclic property of the trace operator: If the matrix products make sense, then $\operatorname{tr} (ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)$. – grand_chat Apr 10 '17 at 23:30
  • Remember that a quadratic form is always a scalar (1x1 matrix). – igorkf Mar 10 '24 at 02:09

2 Answers2

45

Where does the trace come from?

A real number can be thought of as a $1 \times 1$ matrix, and its trace is itself. Thus $$(x-\mu)^\top \Sigma^{-1} (x-\mu) = \operatorname{tr}\left((x-\mu)^\top \Sigma^{-1} (x-\mu)\right)$$

More than one step is taken at once.

After applying the above step, use the cyclic property of the trace to obtain $$\operatorname{tr}\left((x-\mu)^\top \Sigma^{-1} (x-\mu)\right) = \operatorname{tr}\left((x-\mu)(x-\mu)^\top \Sigma^{-1} \right)$$ By linearity of the trace operator, you can push the expectation inside $$E \operatorname{tr}\left((x-\mu)(x-\mu)^\top \Sigma^{-1} \right) = \operatorname{tr}\left(E\left[(x-\mu)(x-\mu)^\top\right] \Sigma^{-1} \right).$$

angryavian
  • 93,534
20

To add to @angryavian's answer, you can swap expectation and trace because, $$ tr(E[A]) = tr(\begin{bmatrix} E[a_{11}] & \dots & \\ \vdots & E[a_{22}] & \\ & & \ddots\end{bmatrix}) = \sum\limits_{i=1}^N E[a_{ii}]=E[\sum\limits_{i=1}^N a_{ii}] = E[tr(A)] $$

drerD
  • 619