feature importance via random forest and linear regression are different

Question

Applied Lasso to rank the features and got the following results:

rank feature prob.
==================================
1       a     0.1825477951589229
2       b     0.07858498115577893
3       c     0.07041793111843796

Note that the data set has 3 labels. The ranking of the features for the different labels are same.

Then applied random forest to the same data set:

rank feature score
===================================
1       b     0.17504808300002753
6       a     0.05132699243632827
8       c     0.041690685195283385

Notice that ranking are very different from the one produced by Lasso.

How to interpret the difference? Does it imply the underlying model is inherently nonlinear?

Sandeep S. Sandhu · Accepted Answer · 2016-06-23T19:54:58.747

So your query is a comparison of linear regression vs. random forest's model-derived importance of variables.

The lasso finds linear regression model coefficients by applying regularization. A popular approach to rank a variable's importance in a linear regression model is to decompose $R^2$ into contributions attributed to each variable. But variable importance is not straightforward in linear regression due to correlations between variables. Refer to the document describing the PMD method (Feldman, 2005) in the references below.

Another popular approach is averaging over orderings (LMG, 1980). The LMG works like this:

Find the semi-partial correlation of each predictor in the model, e.g. for variable a we have: $SS_a/SS_{total}$. It implies how much would $R^2$ increase if variable $a$ were added to the model.
Calculate this value for each variable for each order in which the variable gets introduced into the model, i.e. {$a,b,c$} ; {$b,a,c$} ; {$b,c,a$}
Find the average of the semi-partial correlations for each of these orders. This is the average over orderings.

The random forest algorithm fits multiple trees, each tree in the forest is built by randomly selecting different features from the dataset. The nodes of each tree are built up by choosing and splitting to achieve maximum variance reduction. While predicting on the test dataset, the individual trees output is averaged to obtain the final output. Each variable is permuted among all trees and the difference in out of sample error of before and after permutation is calculated. The variables with highest difference are considered most important, and ones with lower values are less important.

The method by which the model is fit on the training data is very different for a linear regression model as compared to random forest model. But both models don't contain any structural relationships between the variables.

Regarding your query about non-linearity of the dependent variable: The lasso is essentially a linear model which will not be able to give good predictions for an underlying non-linear processes, as compared to tree based models. You should be able to check this by verifying the models performance over a set-aside test set, if the random forest performs better, the underlying process may be non-linear. Alternatively, you could include variable interaction effects and higher order variables created using a, b, and c in the lasso model and verify if this model performs better as compared to a lasso with only a linear combination of a, b and c. If it does, then the underlying process might be non-linear.

References:

feature importance via random forest and linear regression are different

1 Answers1