References:
- "Generalized principal component analysis" by Vidal, Ma, Sastry.
- "Calculus on normed vector spaces" by Coleman.
Consider a symmetric positive definite matrix ${ \Sigma \in \mathbb{R} ^{n \times n} .}$ Consider the optimization problem
$${ {\begin{align} &\, \underset{V \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(V ^T \Sigma V) \\ &\, \text{subject to } \, \, V ^T V = I _d . \end{align}} }$$
Since ${ \Sigma }$ is symmetric positive definite, we can write
$${ \Sigma P = P D }$$
with ${ P = [P _1, \ldots, P _n] }$ orthonormal and ${ D = \text{diag}(\lambda _1, \ldots, \lambda _n) }$ with ${ \lambda _1 \geq \ldots \geq \lambda _n > 0 . }$
Setting ${ X := P ^T V , }$ the optimization is
$${ {\begin{align} &\, \underset{X \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(X ^T D X) \\ &\, \text{subject to } \, \, X ^T X = I _d . \end{align}} }$$
Recall Lagrange multipliers for complete normed spaces. This is from "Calculus on normed vector spaces" by Coleman.
Thm [Lagrange multipliers]:
Let ${ E , F }$ be complete normed spaces, ${ O }$ an open subset of ${ E ,}$ and ${ f : O \longrightarrow \mathbb{R} }$ and ${ g : O \longrightarrow F }$ be ${ C ^1 }$ maps. Suppose that ${ a \in A = g ^{-1} (0) }$ and that ${ f }$ has a relative extremum (minimum or maximum) at ${ a . }$ If ${ g ^{'} (a) }$ is surjective and ${ \text{ker} \, (g ^{'} (a)) }$ has a closed complement ${ L , }$ then there is a unique ${ \lambda \in F ^{*} }$ such that ${ (f - \lambda \circ g ) ^{'} (a) = 0 . }$
Consider the maps
$${ f : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R}, \quad f(X) = \text{tr}(X ^T D X) }$$
and
$${ g : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R} ^{d \times d} _{\text{sym}}, \quad g(X) = X ^T X - I _d. }$$
Note that ${ 0 }$ is a regular value of ${ g . }$
We have ${ (X + H) ^T (X + H) - X ^T X }$ ${ = X ^T H + H ^T X + o(\lVert H \rVert) }$ hence
$${ Dg (X) \, H = X ^T H + H ^T X . }$$
Let ${ A \in g ^{-1} (0) . }$ We are to show
$${\text{To show:} \quad Dg(A) : \mathbb{R} ^{n \times d} \longrightarrow \mathbb{R} ^{d \times d} _{\text{sym}} \, \, \text{ is surjective}. }$$
Let ${ S \in \mathbb{R} ^{n \times n} _{\text{sym}} . }$ For ${ H = AS / 2 }$ we have
$${ {\begin{align} &\, Dg(A) \, H \\ = &\, A ^T H + H ^T A \\ = &\, \frac{1}{2} A ^T A S + \frac{1}{2} S ^T A ^T A \\ = &\, \frac{1}{2} S + \frac{1}{2} S ^T \\ = &\, S, \end{align}} }$$
as needed.
Since ${ g ^{-1} (0) }$ is compact, ${ f }$ has a relative maximum at some ${ U \in g ^{-1} (0) . }$ Now by Lagrange multipliers, there is a unique ${ \Lambda \in \mathbb{R} ^{d \times d} }$ such that ${ U }$ is a critical point of
$${ {\begin{align} \mathcal{L}(X) = &\, f(X) - \text{tr}(\Lambda ^T g(X)) \\ = &\, \text{tr}(X ^T D X ) - \text{tr}(\Lambda ^T (X ^T X - I _d)) . \end{align}} }$$
Note the derivatives
$${ {\begin{align} \frac{\partial \mathcal{L}}{\partial x _{\alpha \beta} } = &\, \frac{\partial}{\partial x _{\alpha \beta}} \left( \sum _{i = 1} ^d X _i ^T D X _i - \sum _{i = 1} ^d \lambda _{ii} (\lVert X _i \rVert ^2 - 1) - \sum _{i \neq j} \lambda _{ij} X _i ^T X _j \right) \\ = &\, \frac{\partial}{\partial x _{\alpha \beta}} (X _{\beta} ^T D X _{\beta}) - \lambda _{\beta \beta} \frac{\partial}{\partial x _{\alpha \beta}} (\lVert X _{\beta} \rVert ^2 - 1) \\ &\, - \sum _{j \neq \beta} \lambda _{\beta j} \frac{\partial}{\partial x _{\alpha \beta}} (X _{\beta} ^T X _j) - \sum _{i \neq \beta} \lambda _{i \beta} \frac{\partial}{\partial x _{\alpha \beta}} (X _i ^T X _{\beta}) \\ = &\, 2 \lambda _{\alpha} x _{\alpha \beta} - \lambda _{\beta \beta} (2 x _{\alpha \beta}) - \sum _{j \neq \beta} \lambda _{\beta j} x _{\alpha j} - \sum _{i \neq \beta} \lambda _{i \beta} x _{\alpha i} \\ = &\, 2 \lambda _{\alpha} x _{\alpha \beta} - \sum _{j} \lambda _{\beta j} x _{\alpha j} - \sum _{i} \lambda _{i \beta} x _{\alpha i} \\ = &\, (2 D X) _{\alpha, \beta} - \sum _{j} x _{\alpha j} (\Lambda ^T) _{j, \beta} - \sum _{i} x _{\alpha i} \lambda _{i \beta} . \end{align}} }$$
Hence the matrix of derivatives
$${ \left[ \frac{\partial \mathcal{L}}{\partial x _{\alpha \beta}} \right] = 2 D X - X \Lambda ^T - X \Lambda . }$$
Hence at the maximizer ${ U }$ we have
$${ {\begin{cases} 2DU - U \Lambda ^T - U \Lambda = O \\ U ^T U = I _d \end{cases}} }$$
that is
$${ {\begin{cases} DU = U \left( \frac{\Lambda ^T + \Lambda }{2}\right) \\ U ^T U = I _d \end{cases}} }$$
Recall the optimization is
$${ {\begin{align} &\, \underset{X \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(X ^T D X) \\ &\, \text{subject to } \, \, X ^T X = I _d . \end{align}} }$$
Since ${ \Lambda _{\text{sym}} = \frac{\Lambda ^T + \Lambda }{2} }$ is symmetric, we can write
$${ \Lambda _{\text{sym}} Q = Q D ^{'} }$$
with ${ Q = [Q _1, \ldots, Q _d] }$ orthonormal and ${ D ^{'} = \text{diag}(\lambda ^{'} _1, \ldots, \lambda ^{'} _d) }$ with ${ \lambda _1 ^{'} > \ldots > \lambda ^{'} _d . }$
Setting ${ \tilde{X} = X Q , }$ the optimization is
$${ {\begin{align} &\, \underset{\tilde{X} \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(\tilde{X} ^T D \tilde{X}) \\ &\, \text{subject to } \, \, \tilde{X} ^T \tilde{X} = I _d . \end{align}} }$$
At the maximizer ${ \tilde{U} = U Q , }$ we have
$${ {\begin{cases} D\tilde{U} Q ^T = \tilde{U} Q ^T \Lambda _{\text{sym}} \\ \tilde{U} ^T \tilde{U} = I _d \end{cases}} }$$
that is
$${ {\begin{cases} D\tilde{U} = \tilde{U} D ^{'} \\ \tilde{U} ^T \tilde{U} = I _d . \end{cases}} }$$
Hence the columns of ${ \tilde{U} }$ are ${ d }$ eigenvectors of ${ D , }$ that is are ${ d }$ elements of ${ \lbrace e _1, \ldots, e _n \rbrace . }$ Amongst these, picking ${ \tilde{U} = [e _1, \ldots, e _d] }$ maximises the objective function ${ \text{tr}(\tilde{U} ^T D \tilde{U}) . }$
Hence the maximizer for
$${ {\begin{align} &\, \underset{\tilde{X} \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(\tilde{X} ^T D \tilde{X}) \\ &\, \text{subject to } \, \, \tilde{X} ^T \tilde{X} = I _d . \end{align}} }$$
is ${ \tilde{U} = [e _1, \ldots, e _d] , }$ the top ${ d }$ eigenvectors of ${ D . }$
Rewriting this in terms of the original optimization, the maximizer for
$${ {\begin{align} &\, \underset{V \in \mathbb{R} ^{n \times d}}{\text{maximize}} \, \, \text{tr}(V ^T \Sigma V) \\ &\, \text{subject to } \, \, V ^T V = I _d . \end{align}} }$$
is ${ P \tilde{U} = [P _1, \ldots, P _d], }$ the top ${ d }$ eigenvectors of ${ \Sigma ,}$ as needed.