35

Tikhonov regularization (or ridge regression) adds a constraint that $\|\beta\|^2$, the $L^2$-norm of the parameter vector, is not greater than a given value (say $c$). Equivalently, it may solve an unconstrained minimization of the least-squares penalty with $\alpha\|\beta\|^2$ added, where $\alpha$ is a constant (this is the Lagrangian form of the constrained problem).

The above is from Wikipedia. Why is the unconstrained LS with $\alpha\|\beta\|^2$ added to the cost equivalent to the LS problem with an additional constraint that $\|\beta\|^2 \leq c$?

What is the relation between $\alpha$ and $c$?

Thanks!

Tim
  • 49,162

3 Answers3

38

Let us first define the two problems:

  • Problem 1: \begin{equation} \min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert^2\end{equation}
  • Problem 2: \begin{align} \min_{\beta} ~&\frac{1}{2}\Vert y-X\beta\Vert^2\\ s.t.~&\Vert \beta\Vert^2-c\leq 0\end{align}

The Lagrangian for Problem 2 reads: \begin{equation} \mathcal{L}(\beta,\lambda)=\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2-c) \end{equation} and you probably already see the resemblance with Problem 1 (identical except for the constant term $-\lambda c$).

Now let us look at the necessary conditions for optimality. For Problem 1, these read: \begin{equation} \nabla_\beta f_\alpha(\beta^*(\alpha))=0 \end{equation} where we voluntarily write $\beta^*(\alpha)$ to show that this is the optimal solution for a given $\alpha$.

For Problem 2, the KKT conditions imply that we have: \begin{align*} \nabla_\beta \mathcal{L}(\beta^*,\lambda^*)&=\nabla_\beta f_\lambda(\beta^*)=0\\ \lambda^* (\Vert \beta^*\Vert^2-c)&=0 \end{align*} The first line says that the gradient of the Lagrangian with respect to $\beta$ should be null and the second is the complementary condition. (We also need $\lambda^* \geq 0$, but this is less important for our discussion). Also observe that the gradient of the Lagrangian is equal to the gradient of $f_\lambda$ (objective function of problem 1 but with $\lambda$ instead of $\alpha$).

Now suppose we solve Problem 1 for a given $\alpha$ and obtain its solution $\beta^*(\alpha)$. Let $c=\Vert \beta^*(\alpha)\Vert^2$, the squared norm of the solution to Problem 1. Then $\lambda^*=\alpha$ and $\beta^*=\beta^*(\alpha)$ satisfy the KKT conditions for Problem 2, showing that both Problems have the same solution. Conversely, if you solved Problem 2, you could set $\alpha=\lambda^*$ to retrieve the same solution by solving Problem 1.

To sum it up, both problems are equivalent when $c=\Vert \beta^*(\alpha)\Vert^2$.

  • That's clear, thanks! – Tim Apr 07 '13 at 16:43
  • 3
    I am not sure, how does this equivalence works in practice, when we want to really bound $|\beta|^2$ by a certain amount $c$ ,but by using regularization. Shall we solve many regularized optimization problems with different regularization parameter $\alpha$ and each time look at $| \beta^*(\alpha)|$ to see if it matches our certain bound? Is it tractable at all? – user25004 Dec 17 '13 at 23:26
  • If $\alpha=0$, does it mean the corresponding $c$ can be any non-negative number, according to the second KKT condition? – Ziyuan Apr 06 '15 at 21:57
  • 1
    "Conversely, if you solved Problem 2, you could set α=λ∗ to retrieve the same solution by solving Problem 1". This does not address the second KKT condition. In general, the minimizer of the Lagrangian (w.r.t beta) must not be primal optimal. To complete the proof you must point out why the second KKT condition can be ignored in this case. – Leo Dec 29 '18 at 23:19
  • 2
    Here's a counter example. Suppose problem 2 is to minimize the function $f(x)=0$ under $x^2<=1$. We have $\lambda^=0$ so the Lagrangian is $0$. If we set $\alpha=\lambda^$, $x=100$ solves problem 1 but not 2 since it is not primal feasible. The problems are equivalent when the Lagrangian has a unique solution, which occurs for $\lambda>0$. – Leo Jan 04 '19 at 11:54
  • The second comment is still an unanswered question. Given problem 1, can we pose problem 2 (know the constraint $c$) without solving problem 1. This answer implies that given problem 1, one would need to find $\beta^(\alpha)$, then we know $c>= |\beta(\alpha)|^2$. How do we find the constraint before solving problem 1? E.g. see question here: https://stats.stackexchange.com/questions/610517/can-we-convert-the-optimization-of-a-loss-function-with-regularization-to-the-la – travelingbones Mar 23 '23 at 22:18
3

Joe's answer looks good, but if you're also looking for a citation, this paper covers it as well in Theorem 1: http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning (Note: The meat of the proof is actually in the supplemental materials).

Kloft et al, "Efficient and Accurate Lp-Norm Multiple Kernel Learning". NIPS 2009.

jhclark
  • 131
2

You can do this directly if you want to. To solve the optimization problem \begin{align} \min_{\beta} ~&\Vert y-X\beta\Vert^2\\ \mathrm{s.t.}~&\Vert \beta\Vert^2\le c\ , \end{align} as in the standard primal-dual procedure, first let \begin{align} g(\lambda)=&\inf_\beta\mathcal{L}(\beta,\lambda)\\ =&\inf_\beta\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2- c)\\ =& \Vert y-X(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2 + \lambda (\Vert(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2-c)\ , \end{align} then solve $\max_{\lambda\ge 0} g(\lambda)$. You will find that $$ \frac{\partial g}{\partial\lambda}=y^\mathrm{T}X(X^\mathrm{T}X+\lambda I)^{-2}X^\mathrm{T}y-c=0\iff c=\Vert\beta^*_{\mathrm{ridge}}(\lambda)\Vert^2\ . $$

The matrix derivatives \begin{align} \frac{\partial AU(x)B}{\partial x} = & A\frac{\partial U(x)}{\partial x}B\\ \frac{\partial U(x)^{-1}}{\partial x} = &-U(x)^{-1} \frac{\partial U(x)}{\partial x}U(x)^{-1} \end{align} will be helpful.

Update:

By the way you can prove when $\lambda$ increases, $c$ doesn't increase. More generally, Let $L(x;\lambda)=f(x)+\lambda g(x)$, and $x_i^*=\mathrm{arg\,min}_xL(x;\lambda_i)\,(i=1,2)$. Suppose $\lambda_2>\lambda_1$ and $g(x_2^*)>g(x_1^*)$, we have \begin{align} &(\lambda_2-\lambda_1)(g(x_2^*)-g(x_1^*))>0\\ \Longrightarrow & \lambda_1g(x_1^*)+\lambda_2g(x_2^*)>\lambda_1g(x_2^*)+\lambda_2g(x_1^*)\\ \Longrightarrow & [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]>[f(x_2^*)+\lambda_1g(x_2^*)]+[f(x_1^*)+\lambda_2g(x_1^*)] \ge [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)] \end{align} which is a contradiction, so $g(x^*)$ doesn't increase when $\lambda$ increases. In the context of OP's problem, $c$ doesn't increase when $\lambda$ increases.

Ziyuan
  • 2,171