3

I'm currently learning machine learning, and I came across this equation called Least Squares Regression.

X and w are both matrices. The multiplication of both matrices becomes y hat, which is theoretically supposed to be equal to y.

We want to minimize the squared error given by this equation by changing w.

enter image description here

w can be solved by a derivation of the function to w, and setting the equation to zero.

The question is, what does it intuitively mean?

I know that in derivative, we are trying to find the rate of change. BUT what does it mean the rate of change = 0 intuitively?

  • 3
    It means that the function is at an extremum of some kind. At these points, the function doesn't change much if you change your variable a little. Consider $f(x) =x^2$. At zero, it has a derivative of zero and if you move just a little away from zero, the function values don't change much from zero. If instead you took instead $x=20$ then if you change $x$ to $20.1$, the function values change quite a bit. – Cameron L. Williams Jul 08 '15 at 04:16
  • Thanks for your reply! So it sounds like at the minimum, the change in w does not effect the function much. But how does this lead to minimum w? – user1157751 Jul 08 '15 at 04:41
  • @CameronWilliams: no, not every critical point is "an extremum of some kind". But every extremum (of a differentiable function) is a critical point. – Robert Israel Jul 08 '15 at 04:41
  • The way I always deal with this problem is this: Let $Py$ be the orthogonal projection of $y$ onto the column space of $X$; thus $Py= X\hat w$ for some vector $\hat w$. Let $Qy=(I-P)y$ be the complementary orthogonal projection onto the orthogonal complelent of the column space of $X$. Then $|Xw-y|^2 = |(Xw-Py)+(I-P)y|^2$ $= |Xw-Py|^2 + |(I-P)y|^2$ by orthogonality. One can choose $w$ so as to make the first square $0$; that's the value of $w$ that minimizes the sum. And it's not hard to show that the value of $w$ that does that is the one given in the formula. ${}\qquad{}$ – Michael Hardy Jul 08 '15 at 05:14
  • Notice that $X$ typically has many more rows than columns, and $X^\top X$ is invertible if and only if $X$ has linearly independent columns. If $X$ does not have linearly independent columns then the value of $w$ that minimizes the sum of squares is not unique. If $X$ does have linearly independent columns then $X$ has a left inverse, which is $(X^\top X)^{-1}X^\top$. ${}\qquad{}$ – Michael Hardy Jul 08 '15 at 05:19
  • @RobertIsrael of course you're right. I was trying to give some intuition without being overly verbose. Can't fit a full explanation in a comment. – Cameron L. Williams Jul 08 '15 at 05:37

2 Answers2

3

The derivative of a function, $f(x)$ being zero at a point, $p$ means that $p$ is a stationary point. That is, not "moving" (rate of change is $0$). There are a few things that could happen.

Either the function has a local maximum, minimum, or saddle point. To determine which one, you need to find out what happens around the point. For example, $f(x)=x^2$ has a minimum at $x=0$, $f(x)=-x^2$ has a maximum at $x=0$, and $f(x)=x^3$ has neither. You can see this by looking at the derivative to the left and right. If there is a sign change, it's an extremum. If there's no sign change, it's a saddle point. I'll leave it to you to figure out which sign change means maximum or minimum.

1

Least squares solutions are a convex set; therefore the extremum is a minimum.

To show the set of least squares minimizers is convex, consider the linear system $Ax = b$ where the system matrix $A\in\mathbb{C}^{m\times n}$, the data vector $b\in\mathbb{C}^{m}$, and the solution vector $x\in\mathbb{C}^{n}$. The least squares solution $x_{LS}$ is defined as $$ x_{LS} = \left\{x\in\mathbb{C}^{n} \colon \lVert Ax - b \rVert_{2}^{2} \text{ is minimized} \right\}. $$

Take a vector from the null space $\eta\in\mathcal{N}(A)$. The vector $A(x+\eta) = Ax$ by virtue of null space vector membership. The convex combination of minimizers is also in the set of minimizers which proves the minimizers are a convex set.

Given $0 < \lambda \le 1$, $$ \begin{align} \lVert A(\lambda x_{LS}) + A(1 - \lambda)(x_{LS} + \eta) - b\rVert_{2}^{2} &= \lVert \lambda Ax_{LS} + A x_{LS} + \underbrace{A \eta}_{0} - \lambda A x_{LS} - \underbrace{\lambda A \eta}_{0} - b \rVert_{2}^{2} \\ &= \lVert Ax_{LS} - b \rVert_{2}^{2} \end{align} $$ Because the convex combination of minimizers in within the set of minimizers, the set is convex.

dantopa
  • 10,768