why is the least square cost function for linear regression convex

Question

I was looking at Andrew Ng's machine learning course and for linear regression he defined a hypothesis function to be $h(x) = \theta_0 + \theta_1x_1 + \dots + \theta_nx_n$, where $x$ is a vector of values, so the goal of linear regression is to find $\theta$ that most closely estimates the real result in order to estimate how wrong the hypothesis is compared to how the data is actually distributed. He uses the least square $$ \mathrm{error} = (h(x) - y)^2, $$ where $y$ is the real result. Since there are a total of $m$ training examples he needs to aggregate them such that all the errors get accounted for. So he defined a cost function $$ J(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h(x_i) - y_i)^2, $$ where $x_i$ is a single training set.

He states that $J(\theta)$ is convex with only $1$ local optimum. I want to know why is this function convex?

The formula for $J(\theta)$ doesn't have a $\theta$ in it. – littleO May 09 '18 at 21:58 — littleO, May 09 '18 at 21:58

littleO · Accepted Answer · 2018-05-09T22:16:05.827

Let $x_i \in \mathbb R^n$ be the $i$th training example and let $X$ be the matrix whose $i$th row is $x_i^T$. Let $y$ be the column vector whose $i$th entry is $y_i$. Define $J:\mathbb R^n \to \mathbb R$ by $$ J(\theta) = \frac{1}{2m} \sum_{i=0}^m (x_i^T \theta - y_i)^2. $$ Notice that $$ J(\theta) = \frac{1}{2m} \| X \theta - y \|_2^2. $$ You can easily check that the function $$ f(\theta) = \frac{1}{2m} \| \theta \|_2^2 $$ is convex by checking that its Hessian is positive definite. (In fact, $$ \nabla^2 f(\theta) = \frac{1}{m} I, $$ where $I$ is the identity matrix.)

A very useful fact to be aware of is that the composition of a convex function with an affine function is convex. Noting that $J(\theta) = f(X \theta - y)$ is in fact the composition of the convex function $f$ with the affine function $\theta \mapsto X \theta - y$, we can invoke this useful fact to conclude that $J$ is convex.

An alternative approach is to compute the Hessian of $J$ directly: $$ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) $$ and $$\nabla^2 J(\theta) = \frac{1}{m} X^T X. $$ The matrix $X^T X$ is positive semidefinite, which shows that $J$ is convex.

sorry, I haven't taken a lot of math courses. what does $\frac{1}{2m}||X\theta-y||_2^2$ mean, specifically I haven't seen the subscript $2$ before — demalegabi, May 10 '18 at 05:58
$| \cdot |_2$ is the $\ell_2$-norm. If $z=(z_1,z_2,\ldots,z_n)$ then $|z|_2=\sqrt{z_1^2+\cdots+z_n^2}$. If $z=(3,4)$ then $|z|_2=5$. Sometimes the subscript 2 is omitted. You might find it helpful to check out some linear algebra books, such as Gilbert Strang's book Introduction to Linear Algebra. — littleO, May 10 '18 at 06:15

score 0 · Answer 2 · edited May 09 '18 at 18:55

0

Defining

$$ f(\theta) = \| h(\theta,x)-y \|^2 $$

with $h(\theta,x) = \langle \theta, x \rangle$

is sufficient to prove

$$ f(\lambda \theta_1+(1-\lambda)\theta_2) \le \lambda f(\theta_1)+(1-\lambda)f(\theta_2) $$

with $0 \le \lambda \le 1$

It is laborious but it is easy to conclude that.

edited May 09 '18 at 18:55

VHarisop

4,100

answered May 09 '18 at 18:40

Cesareo

36,341

It's easier if you invoke the fact that a convex function composed with an affine function is convex. – littleO May 09 '18 at 18:45
Is much simpler than that. Consider the expansion of $\vert\vert h(\theta, x) - y\vert\vert^2 = \vert\vert h(\theta, x)\vert\vert^2+\vert\vert y\vert\vert^2 -2< h(\theta,x),x> $ and the linear structure of $h(\theta,x)$. By comparison. – Cesareo May 09 '18 at 18:48
Correcting. $<h(\theta,x),y >$ – Cesareo May 09 '18 at 19:39

why is the least square cost function for linear regression convex

2 Answers2

Linked