1

When I studied regression using linear least squares (LLS), the solution to $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$ was always presented as

$$\boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \tag{1}$$

but I have recently learnt that a general solution to the least squares problem is given by

$$\boldsymbol{\beta}_{LS} = \mathbf{X}^{+}\mathbf{y} + (\mathbf{I} -\mathbf{X}^{+}\mathbf{X})\mathbf{z} \tag{2}$$

where $\mathbf{X}^{+}$ is the Moore–Penrose (pseudo-)inverse (obtained, for example, via SVD of $\mathbf{X}$).

As I understand it, in cases where $\mathbf{X}$ has linearly independent columns, the matrix $\mathbf{X}^{T}\mathbf{X}$ is invertible and the pseudo-inverse matrix is equal to

$$\mathbf{X}^{+} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$

which means that $(1)$ is equal to the first part on the right-hand side of $(2)$. But why do we never see the second part on the rhs of $(2)$ being used/mentioned in LLS solutions? It seems, that this homogeneous part can be used to derive solutions which have some other desirable properties (in addition to being least-squares), like minimum variance solution given by:

$$\boldsymbol{\beta}_{MV} = \mathbf{X}^{+}\mathbf{y} + \mu(\mathbf{I} -\mathbf{X}^{+}\mathbf{X})\mathbf{1} \tag{3}$$ where

$$ \mu = \frac{\mathbf{1}^T\mathbf{X}^{+}\mathbf{y}}{\mathbf{1}^T\mathbf{X}^{+}\mathbf{X}\mathbf{1}}.$$

And what happens when $\mathbf{X}$ does not have linearly independent columns? I think in this case, I was taught that there is a "multi-collinearity in the system" (or something like that) and that there are no solutions (or, if computed, then they will be "volatile", or the model will be "overfitted", or the standard errors will be large, or some other "dangers"). But $(2)$ tells us that there are solutions even in this case. I am confused: are there solutions in this case or are there no solutions? And is there something "special" about solutions in this case (if they exist)? And how do we reconcile this with all of those "dangers" that we were told about?

Add 1 (Following the comments)

Another case taught in elementary classes is that of an underdetermined system where we are told that when there are more unknows than equations, there exist infinite number of solutions and one has to apply some further constrains to obtain a "particular" solution. One such constrain is that the solution has the minimum norm of all possible solutions, i.e. one that minimizes $||\boldsymbol{\beta}||_2$, and we are given the following form for it:

$$\boldsymbol{\beta}_{MN} = \mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y} \tag{4}$$

But the form of the general LS solution in $(2)$ remains unchanged when we go from a "thin" $\mathbf{X}$ (with more rows than columns) to a "fat" $\mathbf{X}$ (with more columns than rows). But I suspect that it's not about the number of rows vs columns that matters, but something else (maybe the column /row rank?)

(It also means that $(4)$ is also a least-squares solution, since the sum of squared residuals is $0$ in this case, i.e. the minimum possible, but this is never mentioned too....)

So, how come the SVD approach always gives us a solution, but the "classical" ones gives them when the matrix is full rank? What happens to the solutions when we move "between" a full column rank to full row rank?

  • 1
    In this case there are infinitely many solutions that minimizes the error between $X \beta$ and $y$. You may be interested in the "minimum norm solution" in this case. See here for a nice visualization: https://math.stackexchange.com/a/2253614/111671 – obareey May 30 '23 at 07:42
  • @obareey Thank you for taking time to read and comment on my post. I am familiar with the minimum norm solution, but I had an impression that it occurs when $\mathbf{X}$ is "fat" (i.e. when there are more columns than rows and so the systems is "under-determined"), but with multicollinearity (which is what I assume you are referring to) we can still have a "thin" matrix (with more rows than columns), no? – Confounded May 30 '23 at 08:05
  • It is still applicable. It is applicable as long as $\operatorname{ker}(X) \neq {0}$. – obareey May 30 '23 at 09:27
  • Is $\beta = X^T(XX^T)^{-1}y$ correct, that seems suspect to me. It would also be nice to specify which norm you are minimizing, the most common one is $L^1$ minimization i.e. basis pursuit, for which there is no closed form. – Andrew May 31 '23 at 01:14
  • @AndrewZhang I believe it is correct. The norm in minimum norm is L2; the L1 is the "maximum sparsity" solution (or close to it). – Confounded May 31 '23 at 23:22

0 Answers0