0

Consider the linear regression model in the case where number of training samples $n$ < number of parameters $d$ (overparametrized linear regression). The model is

$$y_i = <\beta,\mathbf{x}_i> +\epsilon_i, \quad \beta\in\mathbb{R}^d,\ \epsilon_i\in\mathbb{R},\ i=1,\dots,n$$

where $\epsilon$ is zero-mean noise. Let $\mathbf{X}\in\mathbb{R}^{n\times d}$ be the data matrix:

\begin{align} \mathbf{X} &= \begin{bmatrix} \mathbf{x}_1\\ \mathbf{x}_2\\ \vdots \\ \mathbf{x}_n \end{bmatrix} \end{align}

Now, since $n<d$, $\mathbf{X}^T\mathbf{X}\in\mathbb{R}^{d\times d}$ is not invertible, since $$\text{rank}(\mathbf{X})=\text{rank}(\mathbf{X}^T) \leq \min \{n,d\}=n\implies \text{rank}(\mathbf{X}^T\mathbf{X})\leq n<d$$

There are clearly infinitely many $\beta$ such that

$$\mathbf{y}=\mathbf{X}\beta$$

I'm told that the minimum norm solution is

$$\beta^*=\mathbf{X}^{\dagger}y$$

where, in this specific case $n < d$, $\mathbf{X}^{\dagger}=\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}$. Can you help me prove this in a constructive manner? I tried to premultiply by $\mathbf{X}^T\in\mathbb{R}^{d\times n}$ (which is legit since $\mathbf{y}\in\mathbb{R}^n$), but I get

$$\mathbf{X}^T\mathbf{y}=\mathbf{X}^T\mathbf{X}\beta$$

and now I'm stuck because, as noted above, $=\mathbf{X}^T\mathbf{X}$ is not invertible.

EDIT: the solution

$$\beta^*=\mathbf{X}^{\dagger}y$$

is valid iif $\mathbf{X}\mathbf{X}^T$ is invertible. However, since $\mathbf{X}\mathbf{X}^T\in\mathbb{R}^{n\times n}$, $\text{rank}(\mathbf{X}^T\mathbf{X})\leq n $ and $\mathbf{X}$ is a random matrix, meaning that its rows are random vectors sampled from an (unknown) probability distribution, for most nondegenerate probability distributions $\mathbf{X}\mathbf{X}^T$ will be full rank.

EDIT2: a question was mentioned in the comments, which isn't related to my question. I explicitly asked for a constructive proof of the fact that, when $n<d$, the minimum norm solution is

$$\beta^*=\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}y$$

instead of the usual one, when $\mathbf{X}$ is full column rank

$$\beta^*=(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^Ty$$

Thus, answers that don't constructively derive this expression, are not answers to this question.

DeltaIV
  • 293
  • https://math.stackexchange.com/questions/2253443/difference-between-least-squares-and-minimum-norm-solution –  Jan 31 '22 at 11:26
  • 1
    It is a property of the pseudo inverse to be one that minimize the norm, so you just need to prove that $X^\dagger=X^T(X X^T)^{-1} y$. It is also know that the pseudo inverse is unique and you can verify easily that $X^+=X^T(X X^T)^{-1}$ satisfies $X X^+ X=X$ and $X^+ X X^+=X^+$ therefore $X^\dagger=X^+$. I'm not sure what you consider to be a constructive proof, my guess is that you you will not consider this to be an answer to your question. – P. Quinton Jan 31 '22 at 11:54
  • @d.k.o. no answer to the question you point to, answers my question. I'm asking specifically why the minimum norm solution has the expression $\beta^*=\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}y$. The expression $\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}$ doesn't appear in any of those answers, so they don't answer my question. – DeltaIV Jan 31 '22 at 11:55
  • @P.Quinton you guessed wrong. It's not exactly the answer I was hoping for, but I think with a couple extra details it could work. Would you be willing to take this discussion to chat? I don't remember how to do it, though, so I'd ask you to do that. Thanks in advance – DeltaIV Jan 31 '22 at 12:59
  • I have no clue how to do that (even after a quick search on the global internet), but I think that everything I say is in the wikipedia page of the pseudo inverse, in particular https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse#Linearly_independent_rows and https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse#Projectors. The fact that it is a projector is exactly the property of having minimal norm that you are looking for. – P. Quinton Jan 31 '22 at 14:07
  • @P.Quinton gah, that's a whole lot of properties! My point is that I didn't want to memorize the expression of the Moore-Penrose pseudo-inverse...I'm writing an answer in a few minutes where I try to derive the right expression. I'd really appreciate if you could comment on it. – DeltaIV Jan 31 '22 at 20:50

1 Answers1

1

(Note to whoever was thinking about answering my question: I'm not going to accept my own answer anytime soon, so if you were going to write an answer of yours, don't feel discouraged!)

Proof #1

By Lagrange multipliers: we're looking for

$$\min_{\beta}{\frac{1}{2}||\beta||^2} \quad s.t. \mathbf{X}\beta=\mathbf{y}$$

Convert to unconstrained optimization by introducing a Lagrange multiplier $\lambda$:

$$\min_{\beta,\lambda}\frac{1}{2}||\beta||^2+\lambda^T(\mathbf{X}\beta-\mathbf{y})=\min_{\beta,\lambda}\mathcal{L}(\beta,\lambda)$$

Set gradients to 0:

$$ \begin{equation*} \begin{alignedat}{3} \nabla_{\beta}{\mathcal{L}} & = \beta +\mathbf{X}^T\lambda=0 \\ \nabla_{\lambda}{\mathcal{L}} & = \mathbf{X}\beta-\mathbf{y}=0 \end{alignedat} \end{equation*}$$

Thus $-\mathbf{X}\mathbf{X}^T\lambda=\mathbf{y}$. But since $\mathbf{X}\mathbf{X}^T$ is invertible, we get $$\lambda=-(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}\implies\beta=-\mathbf{X}^T\lambda=\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}\ \square.$$

The nice thing about this proof is that you don't have to commit the expression of the pseudo-inverse to memory.

Proof #2

Not as satisfying. Assume $\beta^*=\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}$: we want to prove that

  1. it's a solution of $\mathbf{X}\beta=\mathbf{y}$
  2. it's the min norm solution

1) is trivial: $\mathbf{X}\beta^*=\mathbf{X}\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}=\mathbf{y}$. To prove 2), it's sufficient to prove that, given any other solution $\beta$ to $\mathbf{X}\beta=\mathbf{y}$, then $(\beta-\beta^*)=\mathbf{u}$ is orthogonal to $\beta^*$. The assertion then follows immediately: $$||\beta||^2=\beta^T\beta=(\mathbf{u}+\beta^*)^T(\mathbf{u}+\beta^*)=||\mathbf{u}||^2+2\mathbf{u}^T\beta^*+||\beta^*||^2=||\mathbf{u}||^2+||\beta^*||^2\geq||\beta^*||^2$$

To prove that $(\beta^*)^T\mathbf{u}=0$, we write

$$(\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y})^T\mathbf{u}=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{u}=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}(\beta-\beta^*)=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}(\mathbf{y}-\mathbf{y})=0$$

where we used the fact that $\mathbf{X}\mathbf{X}^T$ is symmetric and so is its inverse.

DeltaIV
  • 293