Can't wrap my head around the difference between Sparse PCA and OMP. Both try to find a sparse linear combination.
Of course, the optimization criteria is different.
In Sparse PCA we have: \begin{aligned} \max & x^{T} \Sigma x \\ \text { subject to } &\|x\|_{2}=1 \\ &\|x\|_{0} \leq k \end{aligned}
In OMP we have: $$ \min _{x}\|f-D x\|_{2}^{2} \text { subject to }\|x\|_{0} \leq k $$
Even though these are different, they resemble one another in my eyes. I'll explain: In PCA we wish to take the projection that counts for the most variance. If we add the "sparsity" constraint, than we regularize and get some projection which is sparser, but does not account for the most variance possible (without the constraint).
In OMP (the algorithm procedure here), we pretty much do the same thing, iteratively - we find the "atom" that gives the largest inner product - which is the most correlative.
Differences I see:
- Different optimization problem (already said) - however the "applicative" view look very similar, therefore I ask this question
- OMP is an iterative (greedy) procedure, while in Sparse PCA, there are direct solutions?
Moreover, how about this minor modification: we assume that $f$ is "taken out" of $D$ (which applicatively mean that we include to our sparse vectors) - now we have a "basis" $D$, where the OMP result will give us the "best" sparse approximation (variance-wise) of $f$, which resembles PCA on the covariance matrix, no?
Example: $D$ is our data (samples and features). In SPCA we find the projections that count for the most variance of the data (done by defining $\Sigma=DD^T$ and applying the Sparse PCA).
In OMP we do something similar, we "take" one sample, $f$, "out" of matrix $D$, and try to approximate it with the other samples. This "forces" us to "use" $f$ as an atom (with coef = $1.0$), but eventually we will get $y = f - D'x' $ with minimum variance, which translates to $Dx$ where there is coefficient $1.0$ in the row of the sample.
Thanks!