Searching interactions with RandomForest and/or GBM

Question

I'm trying to explain a count variable and a continious variable > 0 with GLM, using R. In order to improve the quality of the regression, I want to add some interactions that can be useful for the model. As I'm a newbie in machine learning, I want to know if RF and GBM can help me to determine useful interactions. I saw that interact.gbm can assess the relative strength of interaction effects in non-linear models. The question is : Will it be "mathematically" correct to add variables with important strength of interaction in order to reduce MSE/Deviance ?

Thank you !

wabbit · Answer 1 · 2016-06-28T14:50:10.687

Several times it does happen that interactions among variables improve the bias of the model. This is especially true when the effect of one independent variable on the target depends on the values of other independent variables. I don't think there's anything mathematically incorrect in doing this.

E.g: let's say you are trying to predict revenue as a function of advertising. In this example it's reasonable to assume that the effect on revenue of one extra unit of advertising on Television would depend on the existing level of advertising on Facebook (say). The true data generating function (if you had access to it) might be something like: $$Revenue=Ad_{TV}^{\beta_{1}}Ad_{Print}^{\beta_{1}}...$$ where $Ad_{TV}$ is the # advertisements shown on TV etc. If you use provide the model the opportunity to handle such interaction terms you will be closer to modeling the true data generating function.

However adding more feature increases the complexity (capacity) of the model and you might have to use regularization wisely to prevent over-fitting

Searching interactions with RandomForest and/or GBM

1 Answers1