How to adjust cofounders in Logistic regression?

Question

I have a binary classification problem where I apply logistic regression.

I have a set of features that are found significant.

But I understand that Logistic regression doesn't consider feature interactions.

While I read online that a lack of feature interaction can be accounted by adjusting logistic regression for confounders.

Currently I did this and got the significant features.

model = sm.Logit(y_train, X_train)
result=model.fit()
result.summary()

But how do I adjust for confounders? Any pointers to tutorial for non-stat person like me would really be helpful.

Can someone let me know on how can I do this?

score 3 · Answer 1 · answered Dec 27 '19 at 13:14

While I'm not sure what you mean when you say "adjust for confounders", I suppose your question is about model choice (or variable/feature selection).

Here are some thoughts on this problem:

Clearly define what you want to achieve: If you want to achieve a good prediction (so you are not up to causal modeling), choose a suitable metric to measure model fit. For a classification problem, you could look at the confusion matrix or AUC (but there are many more options).
Choose a baseline model which you believe is a good starting point for your problem. Look at relevant metrics of this model.
Try to come up with improvements: You could - for instance - work on your features and see if your "new" model performs better (based on the chosen metrics) compared to the baseline model. You can also add interaction terms (between variables) by simply multiplying variables. So when you have a model $y=\beta_0+\beta_1 x_1+\beta_2 x_2 + u$, you can also go for $y=\beta_0+\beta_1 x_1+\beta_2 x_2 + \beta_3 x_1 x_2 + u$ (Note: $\beta$ are the regression coefficients, $u$ is the error term).

Make sure you train your model on one part of your data and you test your model on another part of the data (train/test split).

Some notes on feature selection: Generally it is not a good idea to select features only by looking on significance. Why? Significance tells you if the estimated confidence interval of the coefficient of some variable/feature "crosses zero" (the interval has positive and negative values). This says little about the contribution of a variable to overall model performance. Variables can also be "jointly significant", so even non-significant (single) features can be important (in interplay with other features). Alternatively, a linear transformation of some feature may make it significant, e.g. in the case of adding polynomials (squared terms for instance) or making a logistic transformation. So in essence: don't kick out features only based on significance. By doing so, you have a fair chance of introducing the "omitted variable bias".

In case you have a lot of possibly relevant features and you are sick of doing model selection "by hand", you may also look into Lasso (or Ridge) regression. Under this approaches features are "shrunken" (automatically) when they are not so useful for a good prediction. Here is a very good intro to Lasso/Ridge/Elastic Net by Trevor Hastie and Junyang Qian. The code is in R, but the tutorial is very good.

You would surely gain from looking at the book "Introduction to Statistical Learning". Chapter 4 covers Logit in a very instructive way. Plus there is Python code for the Labs in the book, which could give you a good starting point. Reading your question, the book and the Lab code from the book, would be a natural statring point for you.

score 1 · Answer 2 · answered Dec 27 '19 at 13:06

Confounder (lurking variable) is a variable that influences both the dependent variable and independent variable. While you are right that feature interactions are "missing" in logistic regression I am not sure how can "adjusting for confounders help"

What can definitely help is including these interactions in log.regression formula IF there are any significant ones. How do you know

Domain knowledge and you suspect it
Use random forest for example and deduce interactions from there.

How do you include them? build a second model log.regression model and average predictions for example.

Generally dont play with it, just use a model that does it automatically.

How to adjust cofounders in Logistic regression?

2 Answers2

Linked