7

Can't wrap my head around the difference between Sparse PCA and OMP. Both try to find a sparse linear combination.

Of course, the optimization criteria is different.

In Sparse PCA we have: \begin{aligned} \max & x^{T} \Sigma x \\ \text { subject to } &\|x\|_{2}=1 \\ &\|x\|_{0} \leq k \end{aligned}

In OMP we have: $$ \min _{x}\|f-D x\|_{2}^{2} \text { subject to }\|x\|_{0} \leq k $$

Even though these are different, they resemble one another in my eyes. I'll explain: In PCA we wish to take the projection that counts for the most variance. If we add the "sparsity" constraint, than we regularize and get some projection which is sparser, but does not account for the most variance possible (without the constraint).

In OMP (the algorithm procedure here), we pretty much do the same thing, iteratively - we find the "atom" that gives the largest inner product - which is the most correlative.

Differences I see:

  1. Different optimization problem (already said) - however the "applicative" view look very similar, therefore I ask this question
  2. OMP is an iterative (greedy) procedure, while in Sparse PCA, there are direct solutions?

Moreover, how about this minor modification: we assume that $f$ is "taken out" of $D$ (which applicatively mean that we include to our sparse vectors) - now we have a "basis" $D$, where the OMP result will give us the "best" sparse approximation (variance-wise) of $f$, which resembles PCA on the covariance matrix, no?

Example: $D$ is our data (samples and features). In SPCA we find the projections that count for the most variance of the data (done by defining $\Sigma=DD^T$ and applying the Sparse PCA).

In OMP we do something similar, we "take" one sample, $f$, "out" of matrix $D$, and try to approximate it with the other samples. This "forces" us to "use" $f$ as an atom (with coef = $1.0$), but eventually we will get $y = f - D'x' $ with minimum variance, which translates to $Dx$ where there is coefficient $1.0$ in the row of the sample.

Thanks!

1 Answers1

5

The image of $x^T \Sigma x$ for $x\in B[0,1]$ is an ellipsoid. PCA would try to find the largest length of that ellipsoid. Sparse PCA is trying to see how close you can get to that longest length when only considering sparse $x$. It’s like you’re saying “If I project the ellipsoid onto $k$ axis, what is the longest length of any of those axes?”

All of the above only has to do with $\Sigma$. You are asking for properties of that matrix.

The second problem is called “Sparse signal recovery”. Let $f$ be a vector of data collected in some way. You are saying, How can I combine N columns of $D$ so that I get closest to $f$. Here, the goal is not to understand the matrix D, but to get close to the data vector $f$.

OMP is a greedy algorithm to try and solve SSR. There are other algorithms. For example, if D satisfies the Restricted Isometry Property well, then you can solve a corresponding Basis Pursuit problem and get a solution for the SSR problem. (This is known as compressed sensing. See “An Introduction to Compressive Sampling” by Candes for details.)

So there you have it. PCA is trying to understand a matrix. SSR is trying to match data. OMP tries to solve SSR.

Finally, in some cases, it is proved that OMP actually does solve SSR. See a paper entitled “Complex Orthogonal Matching Pursuit” for details.

NicNic8
  • 7,120
  • Thank you for the detailed answer. How about this minor modification: we assume that $f$ is "taken out" of $D$ (which applicatively mean that we include $f$ to our sparse vectors) - now we have a "basis" $D$, where the OMP result will give us the "best" sparse approximation (variance-wise) of $f$, which resembles PCA on the covariance matrix, no? – Natan ZB Jul 27 '21 at 15:24
  • I don’t understand the modification. What are you including f in? What covariance matrix? I don’t see the resemblance. – NicNic8 Jul 28 '21 at 02:32
  • Hello, @NicNic8, please see the last paragraph I added to the question. I tried to explain myself there. – Natan ZB Jul 28 '21 at 11:01