3

I'm learning PCA and I found the following optimization problem in pages 9 and 13 of Afonso Bandeira's lecture notes.

$$\begin{array}{ll} \underset{V \in \mathbb{R}^{n \times d}}{\text{maximize}} & \mbox{tr} \left(V^T \Sigma V \right)\\ \text{subject to} & V^T V = I_{d}\end{array}$$

where $\Sigma$ is the covariance matrix.

The solution is the first $d$ eigenvectors of $\Sigma$, but I don't know how to get this solution. I tried using Lagrange multipliers but failed to get the solution here.

wz0919
  • 157
  • https://researchweb.iiit.ac.in/~anurag.ghosh/notes/Tutorial-Understanding-PCA-As-Optimization – Dhanvi Sreenivasan Jun 27 '20 at 03:36
  • I think the typical way to do this is via induction and a version of the max-min theorem that characterizes the $d$th largest eigenvalue as the max over all $d$ dimensional subspaces of the minimum Rayleigh quotient over that subspace (which is attained on the span of the top $d$ eigenvectors). – Jason Gaitonde Jun 27 '20 at 03:37
  • 2
    I believe this a duplicate of https://math.stackexchange.com/q/3637453/27978 – copper.hat Jun 27 '20 at 04:01

2 Answers2

1

Write $\Sigma=XDX^t$ where $X$ is orthonormal and $D$ is diagonal with non-negative entries.

We want to maximize $tr(V^tXDX^tV)$. Consider the transformation $W=X^tV$ and ovserve that $W^tW=V^tXX^tV=V^tV=I$. Since $X^t$ is an invertible matrix, this defines an invertible transofmration on the space of allowable $V$s, so the original optimization problem is equivalent to

$max Tr(W^tDW), W^tW=I_d$

On the other hand, $Tr(W^tDW)=Tr(DWW^T)=\sum_i d_i (WW^T)_{ii}$.

Lemma

$0\leq (WW^T)_{ii}\leq 1$.

Proof of lemma

The first inequality is clear, because $(WW^T)_{ii}$ is the squared norm of the $i$th row of $W$. To establish the second, observe that for any matrix $M$, the norm of any column of $M$ is bounded by the largest singular value of $M$. This follows immediately from the characterization $\sigma_1(M)=\sup_{|v|=1} |Mv|$, and noting that the $i$th column is given by $Me_i$, where $e_i$ is a standard basis vector. Furthermore, it is a general fact that the singular values of $M$ are the square roots of the eigenvalues of $MM^T$. In particular, since $W^tW=I$, we conclude that all singular values of $W^t$ are equal to 1, and consequently the norm of each column of $W^t$ is bounded by 1.

(end proof of lemma)

Given the constraints on $(WW^T)_{ii}$ it is clear that $\sum_i d_i (WW^T)_{ii}$ is maximized when $(WW^T)_{ii}=1$ if if $i\leq k$ and $0$ if not (we assume WLOG that the entires of $D$ are ordered from largest to smallest). This can be attained by setting the $i$th column of $W$ to be $e_i$ if $i\leq k$ and $0$ if $i>k$. Finally, remembering that $W=X^tV$ where $X$ is the matrix of eigenvalues of $\Sigma$, we see that $V$ consists precisely of the top $k$ eigenvectors of $\Sigma$.

Simon Segert
  • 5,819
  • 11
  • 22
1

References:

  • "Generalized principal component analysis" by Vidal, Ma, Sastry.
  • "Calculus on normed vector spaces" by Coleman.

Consider a symmetric positive definite matrix ${ \Sigma \in \mathbb{R} ^{n \times n} .}$ Consider the optimization problem

$${ {\begin{align} &\, \underset{V \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(V ^T \Sigma V) \\ &\, \text{subject to } \, \, V ^T V = I _d . \end{align}} }$$

Since ${ \Sigma }$ is symmetric positive definite, we can write

$${ \Sigma P = P D }$$

with ${ P = [P _1, \ldots, P _n] }$ orthonormal and ${ D = \text{diag}(\lambda _1, \ldots, \lambda _n) }$ with ${ \lambda _1 \geq \ldots \geq \lambda _n > 0 . }$

Setting ${ X := P ^T V , }$ the optimization is

$${ {\begin{align} &\, \underset{X \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(X ^T D X) \\ &\, \text{subject to } \, \, X ^T X = I _d . \end{align}} }$$

Recall Lagrange multipliers for complete normed spaces. This is from "Calculus on normed vector spaces" by Coleman.

Thm [Lagrange multipliers]:
Let ${ E , F }$ be complete normed spaces, ${ O }$ an open subset of ${ E ,}$ and ${ f : O \longrightarrow \mathbb{R} }$ and ${ g : O \longrightarrow F }$ be ${ C ^1 }$ maps. Suppose that ${ a \in A = g ^{-1} (0) }$ and that ${ f }$ has a relative extremum (minimum or maximum) at ${ a . }$ If ${ g ^{'} (a) }$ is surjective and ${ \text{ker} \, (g ^{'} (a)) }$ has a closed complement ${ L , }$ then there is a unique ${ \lambda \in F ^{*} }$ such that ${ (f - \lambda \circ g ) ^{'} (a) = 0 . }$

Consider the maps

$${ f : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R}, \quad f(X) = \text{tr}(X ^T D X) }$$

and

$${ g : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R} ^{d \times d} _{\text{sym}}, \quad g(X) = X ^T X - I _d. }$$

Note that ${ 0 }$ is a regular value of ${ g . }$

We have ${ (X + H) ^T (X + H) - X ^T X }$ ${ = X ^T H + H ^T X + o(\lVert H \rVert) }$ hence

$${ Dg (X) \, H = X ^T H + H ^T X . }$$

Let ${ A \in g ^{-1} (0) . }$ We are to show

$${\text{To show:} \quad Dg(A) : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R} ^{d \times d} _{\text{sym}} \, \, \text{ is surjective}. }$$

Let ${ S \in \mathbb{R} ^{n \times n} _{\text{sym}} . }$ For ${ H = AS / 2 }$ we have

$${ {\begin{align} &\, Dg(A) \, H \\ = &\, A ^T H + H ^T A \\ = &\, \frac{1}{2} A ^T A S + \frac{1}{2} S ^T A ^T A \\ = &\, \frac{1}{2} S + \frac{1}{2} S ^T \\ = &\, S, \end{align}} }$$

as needed.

Since ${ g ^{-1} (0) }$ is compact, ${ f }$ has a relative maximum at some ${ U \in g ^{-1} (0) . }$ Now by Lagrange multipliers, there is a unique ${ \Lambda \in \mathbb{R} ^{d \times d} }$ such that ${ U }$ is a critical point of

$${ {\begin{align} \mathcal{L}(X) = &\, f(X) - \text{tr}(\Lambda ^T g(X)) \\ = &\, \text{tr}(X ^T D X ) - \text{tr}(\Lambda ^T (X ^T X - I _d)) . \end{align}} }$$

Note the derivatives

$${ {\begin{align} \frac{\partial \mathcal{L}}{\partial x _{\alpha \beta} } = &\, \frac{\partial}{\partial x _{\alpha \beta}} \left( \sum _{i = 1} ^d X _i ^T D X _i - \sum _{i = 1} ^d \lambda _{ii} (\lVert X _i \rVert ^2 - 1) - \sum _{i \neq j} \lambda _{ij} X _i ^T X _j \right) \\ = &\, \frac{\partial}{\partial x _{\alpha \beta}} (X _{\beta} ^T D X _{\beta}) - \lambda _{\beta \beta} \frac{\partial}{\partial x _{\alpha \beta}} (\lVert X _{\beta} \rVert ^2 - 1) \\ &\, - \sum _{j \neq \beta} \lambda _{\beta j} \frac{\partial}{\partial x _{\alpha \beta}} (X _{\beta} ^T X _j) - \sum _{i \neq \beta} \lambda _{i \beta} \frac{\partial}{\partial x _{\alpha \beta}} (X _i ^T X _{\beta}) \\ = &\, 2 \lambda _{\alpha} x _{\alpha \beta} - \lambda _{\beta \beta} (2 x _{\alpha \beta}) - \sum _{j \neq \beta} \lambda _{\beta j} x _{\alpha j} - \sum _{i \neq \beta} \lambda _{i \beta} x _{\alpha i} \\ = &\, 2 \lambda _{\alpha} x _{\alpha \beta} - \sum _{j} \lambda _{\beta j} x _{\alpha j} - \sum _{i} \lambda _{i \beta} x _{\alpha i} \\ = &\, (2 D X) _{\alpha, \beta} - \sum _{j} x _{\alpha j} (\Lambda ^T) _{j, \beta} - \sum _{i} x _{\alpha i} \lambda _{i \beta} . \end{align}} }$$

Hence the matrix of derivatives

$${ \left[ \frac{\partial \mathcal{L}}{\partial x _{\alpha \beta}} \right] = 2 D X - X \Lambda ^T - X \Lambda . }$$

Hence at the maximizer ${ U }$ we have

$${ {\begin{cases} 2DU - U \Lambda ^T - U \Lambda = O \\ U ^T U = I _d \end{cases}} }$$

that is

$${ {\begin{cases} DU = U \left( \frac{\Lambda ^T + \Lambda }{2}\right) \\ U ^T U = I _d \end{cases}} }$$

Recall the optimization is

$${ {\begin{align} &\, \underset{X \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(X ^T D X) \\ &\, \text{subject to } \, \, X ^T X = I _d . \end{align}} }$$

Since ${ \Lambda _{\text{sym}} = \frac{\Lambda ^T + \Lambda }{2} }$ is symmetric, we can write

$${ \Lambda _{\text{sym}} Q = Q D ^{'} }$$

with ${ Q = [Q _1, \ldots, Q _d] }$ orthonormal and ${ D ^{'} = \text{diag}(\lambda ^{'} _1, \ldots, \lambda ^{'} _d) }$ with ${ \lambda _1 ^{'} > \ldots > \lambda ^{'} _d . }$

Setting ${ \tilde{X} = X Q , }$ the optimization is

$${ {\begin{align} &\, \underset{\tilde{X} \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(\tilde{X} ^T D \tilde{X}) \\ &\, \text{subject to } \, \, \tilde{X} ^T \tilde{X} = I _d . \end{align}} }$$

At the maximizer ${ \tilde{U} = U Q , }$ we have

$${ {\begin{cases} D\tilde{U} Q ^T = \tilde{U} Q ^T \Lambda _{\text{sym}} \\ \tilde{U} ^T \tilde{U} = I _d \end{cases}} }$$

that is

$${ {\begin{cases} D\tilde{U} = \tilde{U} D ^{'} \\ \tilde{U} ^T \tilde{U} = I _d . \end{cases}} }$$

Hence the columns of ${ \tilde{U} }$ are ${ d }$ eigenvectors of ${ D , }$ that is are ${ d }$ elements of ${ \lbrace e _1, \ldots, e _n \rbrace . }$ Amongst these, picking ${ \tilde{U} = [e _1, \ldots, e _d] }$ maximises the objective function ${ \text{tr}(\tilde{U} ^T D \tilde{U}) . }$

Hence the maximizer for

$${ {\begin{align} &\, \underset{\tilde{X} \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(\tilde{X} ^T D \tilde{X}) \\ &\, \text{subject to } \, \, \tilde{X} ^T \tilde{X} = I _d . \end{align}} }$$

is ${ \tilde{U} = [e _1, \ldots, e _d] , }$ the top ${ d }$ eigenvectors of ${ D . }$

Rewriting this in terms of the original optimization, the maximizer for

$${ {\begin{align} &\, \underset{V \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(V ^T \Sigma V) \\ &\, \text{subject to } \, \, V ^T V = I _d . \end{align}} }$$

is ${ P \tilde{U} = [P _1, \ldots, P _d], }$ the top ${ d }$ eigenvectors of ${ \Sigma ,}$ as needed.