Questions tagged [ridge-regression]

A regularization method for regression models that shrinks coefficients towards zero.

Ridge Regression is a technique which penalizes the size of regression coefficients in order to deal with multicollinear variables or ill-posed statistical problems. It is based on the Tikhonov regularization named after the mathematician Andrey Tikhonov.

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{J}$, the estimation problem is:

$$\min_\beta \sum\limits_{i=1}^{n} (y_i - x_i'\beta)^2 + \lambda \sum\limits_{j=1}^J \beta_j^2$$

for which the solution is given by

$$\widehat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'y$$

which is similar to the OLS estimator but including the tuning parameter $\lambda$ and the Tikhonov matrix (in this case $I$, the identity matrix but other choices are possible). Note that, unlike the OLS estimator, the ridge regression estimator is always invertible even if there are more parameters in the model than degrees of freedom and hence there always exists a unique solution to the estimation problem.

Bayesian derivation

Ridge regression is equivalent to Bayesian linear regression assuming a Normal prior on $\beta$. Define the likelihood:

$$L(X,Y;\beta,\sigma^2) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}$$

And using a normal prior with mean 0 and variance $\alpha I_p$ on $\beta$:

$$\pi(\beta) \sim N(0,\alpha I_p)$$

Using Bayes rule, we calculate the posterior distribution:

$$P(\beta | X,Y) \propto L(X,Y;\beta,\sigma^2)\pi(\beta) $$ $$ \propto \big[\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}\big]e^{-\frac12\beta^T(\alpha^2 I_p)^{-1}\beta}$$

Maximizing the posterior is equivalent to minimizing the negative of the log of the posterior (after some algebra):

$$log (P(\beta | X,Y)) \propto -\frac12\big(\frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{1}{\alpha}\beta^T\beta\big)$$ $$\propto \sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{\sigma^2}{\alpha}\sum_{j=1}^{p}\beta^2 $$

Where $\frac{1}{\alpha}$ is the tuning parameter, corresponding to the choice of $\lambda$ from above.

The tuning parameter $\lambda$ determines the degree of shrinkage of the regression coefficients. The idea is to introduce some degree of bias in order to improve the variance (see bias-variance trade-off). In cases of highly multicollinear variables a small increase in bias to trade off for a lower variance can have a substantial effect.

The bias of the ridge regression estimator is $$Bias(\widehat{\beta}) = -\lambda (X'X + \lambda I)^{-1} \beta$$ It is always possible to find $\lambda$ such that the MSE of the ridge regression estimator is smaller than that of the OLS estimator.

Note that as $\lambda \rightarrow 0, \beta_{ridge} \rightarrow \beta_{ols}$ and as $\lambda \rightarrow \infty, \beta_{ridge} \rightarrow 0$. It is therefore important how to choose the value for $\lambda$. Common methods for this decision include the use of information criteria (AIC or BIC) or (generalized) cross-validation.

34 questions
7
votes
1 answer

When the regression models outperforms naive method?

I followed from this question. Case1: I have the following task to do: Training by the consecutive 3 days to predict the each 4th day. Each day data represents one CSV file which has dimension 24x25. Every datapoints of each CSV file are…
4
votes
2 answers

Extremely high MSE/MAE for Ridge Regression(sklearn) when the label is directly calculated from the features

Edit: Removing TransformedTargetRegressor and adding more info as requested. Edit2: There were 18K rows where the relation did not hold. I'm sorry :(. After removing those rows and upon @Ben Reiniger's advice, I used LinearRegression and the…
RAbraham
  • 197
  • 8
4
votes
2 answers

What does a negative coefficient of determination mean for evaluating ridge regression?

Judging by the negative result being displayed from my ridge.score() I am guessing that I am doing something wrong. Maybe someone could point me in the right direction? # Create a practice data set for exploring Ridge Regression data_2 =…
Ethan
  • 1,657
  • 9
  • 25
  • 39
3
votes
1 answer

Does ridge regression always reduce coefficients by equal proportions?

Below is an excerpt from the book Introduction to statistical learning in R, (chapter-linear model selection and regularization) "In ridge regression, each least squares coefficient estimate is shrunken by the same proportion" On a simple dataset, I…
3
votes
3 answers

Can ridge regression be used for feature selection?

I'm trying to figure out whether using Ridge Regression for regularization can be used to cause a more sparse hypothesis however to me it seems like ridge will never actually bring any coefficients to zero, only really close to it. So can ridge…
3
votes
1 answer

Does it matter whether we put regularization parameter ($C$) with error or weight term in Kernel ridge regression?

Kernel ridge regression associate a regularization parameter $C$ with weight term ($\beta$): $\text{Minimize}: {KRR}=C\frac{1}{2} \left \|\beta\right\|^{2} + \frac{1}{2}\sum_{i=1}^{\mathcal{N}}\left\|e_i \right \|_2^{2} \\ \text{Subject to}:\…
Chandan Gautam
  • 311
  • 3
  • 13
3
votes
1 answer

regression model outperform every models

I followed from this question. Case1: I have the following task: Train for consecutive 3 days to predict each fourth day. Each day's data represents one CSV file, which has dimensions 24x25. Each datapoint of each CSV file is pixels. In this case,…
S. M.
  • 125
  • 17
3
votes
2 answers

Constraining linear regressor parameters in scikit-learn?

I'm using sklearn.linear_model.Ridge to use ridge regression to extract the coefficients of a polynomial. However, some of the coefficients have physical constraints that require them to be negative. Is there a way to impose a constraint on those…
2
votes
3 answers

how Lasso regression helps to shrinks the coefficient to zero and why ridge regression dose not shrink the coefficient to zero?

How does Lasso regression help with feature selection of model by making the coefficient shrink to zero? I could see few below with below diagram. Can any please explain in simple terms how to correlate below diagram with: How Lasso shrinks the…
star
  • 1,521
  • 7
  • 20
  • 31
2
votes
2 answers

How do standardization and normalization impact the coefficients of linear models?

One benefit of creating a linear model is that you can look at the coefficients the model learns and interpret them. For example, you can see which features have the most predictive power and which do not. How, if at all, does feature…
2
votes
0 answers

Reverse engineering what stocks are in a dummy ETF using regression (lasso, ridge, etc) in Python

I'm trying to reverse engineer what stocks are in a ETF using python. In my code, I create a fake ETF that is equal weighted 20 random stocks. I then try to reverse engineer whats in my ETF using price data for a universe of 200+ stocks. No matter…
Mac
  • 29
  • 1
1
vote
1 answer

Why we take $\alpha\sum B_j^2$ as penalty in Ridge Regression?

$$RSS_{RIDGE}=\sum_{i=1}^n(\hat{y_i}-y_i)^2+\alpha\sum_{i=1}^nB_j^2$$ Why we are taking $\alpha\sum B_j^2$ as a penalty here? We are adding this term for minimizing variance in Machine Learning Model. But how this term minimizing variance. If I add…
1
vote
1 answer

Do the benefits of ridge regression diminish with larger datasets?

I have a question about ridge regression and about its benefits (relative to OLS) when the datasets are big. Do the benefits of ridge regression disappear when the datasets are larger (e.g. 50,000 vs 1000)? When the dataset is large enough, wouldn't…
1
vote
2 answers

What is the meaning of the sparsity parameter

Sparse methods such as LASSO contain a parameter $\lambda$ which is associated with the minimization of the $l_1$ norm. Higher the value of $\lambda$ ($>0$) means that more coefficients will be shrunk to zero. What is unclear to me is that how does…
Sm1
  • 541
  • 5
  • 19
1
vote
0 answers

what other metrics can i use to estimate quality of the model predicting income range - interval estimation task?

I trained a model that predicts customer's income given the features: age, declared income number of oustanding instalment, overdue total amount active credit limit, total credit limit total amount The output is a prediction: lower-upper bound for…
1
2 3