4

My understanding of LASSO regression is that the regression coefficients are selected to solve the minimisation problem:

$$\min_\beta \|y - X \beta\|_2^2 \ \\s.t. \|\beta\|_1 \leq t$$

In practice this is done using a Lagrange multiplier, making the problem to solve

$$\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 $$

What is the relationship between $\lambda$ and $t$? Wikipedia unhelpfully simply states that is "data dependent".

Why do I care? Firstly for intellectual curiosity. From having searched on this site and others I can't find any explicit discussion of this link.

Beyond being curious, I am concerned about the consequences for selecting $\lambda$ by cross-validation.

Specifically, if I'm doing n-fold cross validation, I fit n different models to n different partitions of my training data. I then compare the accuracy of each of the models on the unused data for a given $\lambda$. But the same $\lambda$ implies a different constraint ($t$) for different subsets of the data (i.e., $t=f(\lambda)$ is "data dependent").

Isn't the cross validation problem I really want to solve to find the $t$ that gives the best bias-accuracy trade-off?

I can get a rough idea of the size of this effect in practice by calculating $\|\beta\|_1$ for each cross-validation split and $\lambda$ and looking at the resulting distribution. In some cases the implied constraint ($t$) can vary quiet substantially across my cross-validation subsets. Where by substantially I mean the coefficient of variation in $t>>0$.

So why is it OK to find an optimal $\lambda$, where the amount of regularisation implied by $\lambda$ depends on how I slice the data, instead of finding explicitly the optimal amount of regularisation (i.e., $t$)?

Jean-Claude Arbaut
  • 23,601
  • 7
  • 53
  • 88

1 Answers1

1

The relationship between $\lambda$ and $t$ is given via the KKT conditions; specifically, that of complementary slackness, which states that

$$\lambda(\|\widehat{\beta}\|_1 - t) = 0,$$

where $\widehat{\beta}$ is the solution to the LASSO problem (and is a function of $\lambda$ itself). Thus, for $\lambda$ nonzero (i.e. the LASSO solution is not just the OLS estimator), we have that $\|\widehat{\beta}\|_1 = t$ to be the connection between $\lambda$ and $t$.

Unfortunately, this is likely the most that can be said in general, as the LASSO estimator does not have a closed form solution. However, the connection can be made more explicit outside the high-dimensional setting where $n \geq p$.

In this setting, suppose that $X$ is orthogonal (otherwise just use the QR decomposition of $X$ and perform the change of variables $\gamma = R\beta$; we then have $X\beta = Q\gamma$ so we can equivalently perform LASSO to estimate $\gamma$ with orthogonal design $Q$). Then we have a closed-form solution for each component of the LASSO estimator: $$\widehat{\beta}_i = \operatorname{sign}(\beta^{OLS}_i)\cdot(|\beta^{OLS}| - n\lambda)^+.$$

By the KKT condition, we have the relationship $$\sum_{i=1}^n(|\beta^{OLS}_i| - n\lambda)^+ = t.$$

Now the relationship is easily seen: As $\lambda$ increases, more components of the sum are set to zero, so $t$ must decrease, and vice versa. The exact relationship depends on $n$ and the OLS estimator $\beta^{OLS} = X^\top y$. However, so long as $X$ and $y$ are given, one can use the above formula to solve exactly for the relationship between $\lambda$ and $t$.

Mathuss
  • 220