1

If I understand regularization correctly, it helps if a least-squares problem is not well-posed thus...

  1. the problem has no solution

  2. the problem has multiple solutions

  3. a small change in the input leads to a large change in the output

Let's say that for our regression problem $X\beta=y$ the matrix $X$ has full rank, isn't the least squares solution $\beta = (X^* X)^{-1} X^* y$ always defined and always unique? Is this not correct or does regularization help with the 3rd condition?

  • Doesn't LS always have at least one solution? – Rodrigo de Azevedo May 22 '22 at 19:41
  • what I've read LS is unique iff the columns of $X$ are linearly independent $\Leftrightarrow X$ has full rank but correct me if I'm wrong and as far as I know we also use regularization even when $X$ has full rank – jonithani123 May 22 '22 at 19:54
  • Well it would be possible if $X$ is not full rank right? In that case I understand why we need regularization – jonithani123 May 22 '22 at 20:20
  • Which exact problem are you talking about? Linear system $\bf A x = b$ may not have a solution but $\bf A^\top A x = A^\top b$ always has at least one solution, if I properly recall. My memory may be failing me. – Rodrigo de Azevedo May 22 '22 at 20:26
  • 1
    It seems that my memory is not failing me. Take a look at this. – Rodrigo de Azevedo May 22 '22 at 20:29
  • ahh nice, I thought if $A^TA$ is not invertible there is no solution but this makes sense. But I'm still wondering why we need regularization in the case where $\beta$ is uniquely defined – jonithani123 May 22 '22 at 20:32
  • If $X$ has less than full rank, then $X^*X$ will not be invertible, so that formula makes no sense. However, it still has a pseudoinverse, and using that will give you the solution of the least squares problem that has the smallest magnitude. I honestly do not know what regularization means here. And there's no need for anything when $X$ has full rank. – Ted Shifrin May 22 '22 at 20:32
  • 1
    There are times when $\bf x$ is one's input, which costs money or energy. In such cases, you want to make $| {\bf A x - b } |_2^2$ small but not waste too much money / energy in the process. – Rodrigo de Azevedo May 22 '22 at 20:33
  • as in the comment under the other answer, in the case that you use regularization just for the solution to have a certain form, does this definition of regularization still make sense "The construction of approximate solutions of ill-posed problems that are stable with respect to small perturbations of the initial data"? – jonithani123 May 22 '22 at 20:49
  • @jonithani123 Have you taken a look at this or this? – Rodrigo de Azevedo May 22 '22 at 20:51

1 Answers1

1

Usually regularization is applied to regression problems that $X$ is a fat matrix (opposing to tall matrix), $N\times p$ where $p>N$, i.e. you have more predictors than data points.

Then in that case the original Least square solution $(X^TX)^{-1}X^Ty$ doesn't make sense, since $X^TX$ is at most rank $p$ thus degenerate and non-invertible. In the original problem $X\beta = y$ there are more than 1 solution: moving in the null space of $X$ won't change the solution $\beta\in \beta_0+null(X)$.

Then introducing a regularization term e.g. in Ridge, makes the least square formula invertible and unique again $$ \hat\beta_{ridge} = (X^TX+\lambda I)^{-1}X^Ty $$

You can argue this brings stability to the solution since adding $\lambda$ reduce the condition number of $(X^TX+\lambda I)$ which makes the inversion more numerically stable.


From a geometric viewpoint, we can also say the regularizations added an additional loss term to distinguish the solutions in the solution manifold ($\beta\in \beta_0+null(X)$), which makes the solution unique. (see this famous illustration of the loss landscape of regularized regression)

  • 1
    I've found this definition of regularization "The construction of approximate solutions of ill-posed problems that are stable with respect to small perturbations of the initial data" as in the case you describe. But what about using regularization e.g. to get sparse solutions in the $N \geq p$ case, does this definition still make sense or is it incomplete? – jonithani123 May 22 '22 at 20:47
  • 1
    I think you are talking about regularizors like LASSO for sparse solution? Indeed, it's harder to illustrate the point by my rank deficiency argument. – Binxu Wang 王彬旭 May 22 '22 at 21:26
  • 1
    I updated the answer to include this part! :D – Binxu Wang 王彬旭 May 22 '22 at 21:32