17

We know that there are two definitions to describe lasso.

Regression with constraint definition: $$\min\limits_{\beta} \|y-X\beta\|^2, \sum\limits_{p}|\beta_p|\leq t, \exists t $$ Regression with penalty definition: $$\min\limits_{\beta} \|y-X\beta\|^2+\lambda\sum\limits_{p}|\beta_p|, \exists\lambda$$

But how to convince these two definition are equivalent for some $t$ and $\lambda$? I think Lagrange multipliers is the key to show the relationship between two definitions. However, I failed to work out it rigorously because I assume the properties of lasso ($\sum\limits_{p}|\beta_p|=t$) in regression with constraint definition.

Does anyone can show me the complete and rigorous proof of these two definitions are equivalent for some $t$ and $\lambda$?

Thank you very much if you can help.

EDIT: According to the the comments below, I edited my question.

Timespace
  • 451
  • 1
    I think you have a problem here... $y-X\beta$ is a vector, so the squared term is ill-posed. Furthermore, $|\beta|$ is a scalar, with no subscripts to sum over. I'm thinking you've put the summation in the wrong place. For instance, I suspect the penalty definition is $\left(\sum_i(y_i-X_i\beta)^2\right)+\lambda|\beta|$. – Michael Grant Jun 10 '13 at 04:20
  • 1
    Furthermore, there is certainly not a one-to-one correspondence between $\lambda$ and $t$ without further qualifications. For instance, let $\bar{\beta}$ be the minimizer of the penalty definition with $\lambda=0$. Then the optimal value of the constraint definition is the same for any $t\geq|\bar{\beta}|$, the optimal value of the constrained problem is the same. Thus all values of $t\in[|\bar{\beta}|,+\infty)$ correspond to $\lambda=0$. Similarly, for some choices of the norm $|\beta|$, there may be an infinite interval of $\lambda$ values corresponding to $t=0$. – Michael Grant Jun 10 '13 at 04:31
  • Your edits are not sufficient. First of all, $\beta_i$ is a scalar, so $|\beta_i|$ is just $|\beta_i|$, correct? Given that this is the LASSO I'd just replace the whole summation with $|\beta|_1$ and be done with it. But there is still the matter of the quantity $(y-X\beta)^2$, which is a vector, not a scalar. So the objective function is ill-posed. – Michael Grant Jun 10 '13 at 17:06

3 Answers3

7

Here is one direction.

(1) The constrained problem is of the form \begin{array}{ll} \text{Find} & x \\ \text{To minimize} & f(x) \\ \text{such that} & g(x) \leqslant t \\ & \llap{-} g(x) \leqslant t. \end{array} Its Lagrangian is $$ L(x, \mu_1, \mu_2) = f(x) + \mu_1' ( g(x) - t ) + \mu_2' ( - g(x) - t ) $$ and the KKT conditions are \begin{align*} \nabla f + \mu_1' \nabla g - \mu_2' \nabla g &= 0 \\ \mu_1, \mu_2 &\geqslant 0 \\ \mu_1' ( g(x) - t ) &= 0 \\ \mu_2' ( - g(x) - t ) &= 0 . \end{align*}

(2) The penalized problem is just the minimization of $f(x) + \lambda' g(x)$. It is unconstrained, and the first order condition is $$ \nabla f + \lambda ' \nabla g = 0. $$

Given a solution of the constrained problem, the penalized problem with $\lambda = \mu_1 - \mu_2$ has the same solution. (For a complete proof, you also need to check that, in your situation, the KKT conditions and the first order condition are necessary and sufficient conditions.)

  • 3
    Aren't the first order conditions for the penalised problem formulation not applicable here? The penalty function $g$ is not differentiable here. – Vossler Mar 26 '16 at 20:20
  • 1
    Also, the constrained problem is not of the form that you wrote since it is a sum of absolute values, not the absolute value of the sum. – Vossler Mar 26 '16 at 21:16
  • 2
    @Vossler: the KKT (and first order) conditions are still applicable with a subgradient instead of a gradient: since $g$ is convex, it has a subgradient. The correct constraint can be recovered by setting $g(x) = \left| x \right| _1$; the $-g(x) \leqslant t$ constraint I had added is then no longer needed. For a better explanation of the equivalence between the constrained and penalized formulations of the lasso, one can check Statistical Learning with Sparsity, in particular exercises 5.2 to 5.4. – Vincent Zoonekynd Mar 27 '16 at 15:09
  • My questions were the same as @Vossler's and thanks to Vincent Zoonkynd for the pointers. The link you added is no longer active, but I found others and wrote the full answer below. – travelingbones Apr 20 '23 at 13:01
0

It's not really intuitive to see, but here is one way to look at it using only basic inference.

Suppose $\beta^{*}$ is a solution to the regression with penalty problem (with some $\lambda$) and $\beta^{**}$ is a solution to the regression with constraint problem with $t = |\beta^{*}|$ (where $|\bullet|$ denotes the $\ell_1$ norm : $|\beta| = \sum\limits_{p}|\beta_p|$). We show that the two problems are equivalent in the sense that $\beta^{*}$ is also a solution to the constraint problem and that $\beta^{**}$ is also a solution to the penalty problem.

  1. Because $\beta^{*}$ is a solution of the penalty problem, for all $\beta$ we have $\|y-X\beta\|^2+\lambda|\beta| \ge \|y-X\beta^{*}\|^2+\lambda|\beta^{*}|$
    which implies that $\|y-X\beta\|^2 \ge \|y-X\beta^{*}\|^2$ for all $\beta$ such that $|\beta| \le t = \beta^{*} $ from which we conlude that $\beta^{*}$ is a solution to the constraint problem.
  2. Because $\beta^{**}$ is a solution of the constraint problem we have $|\beta^{**}| \le t=|\beta^{*}|$ and $\|y-X\beta^{*}\|^2 \ge \|y-X\beta^{**}\|^2$ and because $\beta^{*}$ is a solution of the penalty problem we have
    $\forall \beta, \space \|y-X\beta\|^2+\lambda|\beta| \ge \|y-X\beta^{*}\|^2+\lambda|\beta^{*}|$.
    Those imply $\forall \beta, \space \|y-X\beta\|^2+\lambda|\beta| \ge \|y-X\beta^{**}\|^2+\lambda|\beta^{**}|$ which allows us to say that $\beta^{**}$ is a solution to the penalty problem.

We can easily see that $|\beta^{*}| = |\beta^{**}|$ and $\|y-X\beta^{*}\|^2 = \|y-X\beta^{**}\|^2$ but we don't really need this in the proof.

  • If you solve the penalty problem for some $\lambda$ and get $\beta^$. Then you set $t = |\beta^|$ and solve the "constraint problem with $t$" to get $\beta^{}$ (as you did in 1.) then indeed, $\beta^*$ is also a solution to the constraint problem. But 2 does not work as written b/c it is circular. Fix $t$ first and solve the constraint problem to get $\beta^{}$. Now which $\lambda$ do you choose to define the penalty problem?? You don't yet have $\beta^{*}$ b/c you don't know which $\lambda$ to use formulate the penalty problem. We need Lagrangian with $g = |\beta|-t$ which isn't $C^1$ – travelingbones Apr 19 '23 at 01:12
0

Let $f$ be convex and $C^1$ (continuously differentiable), and let $g(x) = \|x\|_1$.

Note that $g$ is convex and not differentiable if some component $x(i)=0$.

Let (*) be the penalty problem: For fixed $\lambda \geq 0$, $$\text{arg}\min_x f(x) + \lambda g(x)\phantom{aaaaaaa} (*)$$

Let (**) be the constrained optimization problem: For fixed $r>0$,

$$\text{arg}\min_x f(x) \phantom{aaaaaaaaaaaa} (**) \\ \text{ subject to } g(x) \leq r \phantom{aaaaaaaaaaaaaaa}$$

  1. Now the forward direction follows from LE TRAN Duc Kinh's answer (1). I'll repeat here with my notation:

Suppose $x_0$ is a solution to (*). Set $r = \|x_0\|_1$ so $g(x)\leq r$ is simply the condition that $x$ is in the $\ell(1)$ ball of radius $r$, and let $x_1$ be a solution to the constraint problem (**) (for this $r$).

Since $\|x_0\|_1 = r$, we have $f(x_1) \leq f(x_0)$. On the other hand, $f(x_0) + \lambda g(x_0) \leq f(x_1) + \lambda g(x_1) $ (since $x_0$ solves (*); from which it follows that $f(x_0) \leq f(x_1)$ as $g(x_0) = r = g(x_1)$. Hence we see $x_0$ is a solution to (**)!

  1. For the reverse direction, we need the machinery of subgradients. I am following notes here and book here (Sec. 5.2.2).

Definition: $z$ is a subgradient of a of function $h$ at $x$ iff for all $y$ we have $$ h(y) - h(x) \geq \langle z, y-x \rangle$$ Let $\partial h(x) = \{z: z \text{ is a subgradient of } h \text{ at } x\}$

It can be shown that if $h$ is differentiable at $x$, then $\partial h(x)$ is the singleton set $\{\nabla h(x)\}.$

Note that it follows immediately from the definition that $z \in \partial h(x)$ means $z$ points in a direction where $h$ is increasing: $h(x+\eta z) - h(x) \geq \langle z, \eta z\rangle = \eta \|z\|_2^2$

Theorem: If $f$ is differentiable at $x$ and $g$ is convex, then the "obvious" generalization of linearity of differentiation applies: $\partial(f+g)(x) = \{\nabla f(x) + z: z\in \partial g(x) \} = \nabla f(x) + \partial g(x)$.

Theorem (optimization of convex, non-differentiable $h$): Now, $x_0 = \text{arg}\min_x h(x)$ iff for all $y$ $h(y) - h(x_0) \geq 0$. Using the definition above, this is iff $ 0 \in \partial h(x_0).$

Theorem (constrained optimization of convex, non-differentiable $h$): Now, $x_0 = \text{arg}\min_x h(x)$ subject to $x\in X$ (closed) iff for all $y\in X $ $h(y) - h(x_0) \geq 0$. Using the definition above, this is iff there exists some $z \in \partial h(x)$ so that $\langle z, y-x_0\rangle = 0 $ for all $y\in X$.

We are now prepared to show the reverse direction.

It is stated in the reference 2 that the Lagrangian/KKT conditions hold for convex (non differentiable) functions with the subgradient. We show how this works (b/c we need it). Draw a picture of the $\ell(1)$ ball ($g$'s level set) and some level sets of $f$ to see how this works.

Fix $r>0$ and let $x_1$ be a solution to (**). If $\|x_1\|_1 <r $, then $x_1$ is an unconstrained local minima; set $\lambda_1 = 0$. If $\|x_1\|_1 = r$, this means $\nabla f(x_1)$ must be pointing into $\{ g \leq r \}$ (b/c the gradient points directly uphill, so if it pointed into $\{ g > r \}$ stepping backwards along the gradient would provide a lower minima in the constraint region). This means $\nabla f(x_1)$ is parallel and opposite in direction to some $z_1 \in \partial g(x_1)$ (remember $z_1$ points in a direction along which $g$ not going downhill (is going at least somewhat uphill or flat) by definition of subgradient). Hence there is some $\lambda_1 > 0$ so that $\nabla f(x_1) = -\lambda_1 z_1$.

We can now see that the pair, $x_1, \lambda_1$, is a stationary point (derivatives = 0) for the Lagrangian $L(x, \lambda) = f(x) + \lambda (g(x)-r)$, considered only where $\lambda \geq 0$).
Specifically, it follows from the fact that $\nabla f(x_1) + \lambda_1 z_1 = 0 \in \partial L(\lambda_1, x_1)$.

We have now shown that a solution to the constraint problem $(**)$ is equivalent to finding a stationary point of the Lagrangian with $\lambda >0$.

To finish, we now move to (*) with $\lambda = \lambda_1$. It follows immediately that $0 = \nabla f(x_1) +\lambda_1 z_1 \in \partial (f + \lambda_1 g)(x_1).$