How Mllib in Spark select variables in logistic regression

Question

I have a question about MLlib in Spark.(with Scala)

I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use SAS or R to do logistic regressions but I now have to do it on Spark to be able to analyze Big Data.

How is the variable selection done? Is there any try of different variable combinations in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD? Something like a test of significance of variable one by one? Or a correlation calculation with the variable of interest? Is there any calculation of BIC, AIC to choose the best model?

Because the model only returns weights and intercept...

How can I understand those Spark functions and compare to what I'm used to with SAS or R ?

score 4 · Answer 1 · answered May 04 '15 at 14:59

First, the spark programming guide for LogisticRegressionWithSGD recommends using L-BFGS instead, so perhaps focus on the one. As for variable selection, the model description on the MLLib page for regressions has a nice explanation of how models are constructed and selected, but it does not address variable selection. This leads me to believe that it considers all variables, and simply chooses the model with the best fit.

score 1 · Answer 2 · answered Dec 18 '16 at 00:20

You could always do a Lasso regression by setting the elastic net parameter to 1: val reg = new LogisticRegression() .setElasticNetParam(1) The Lasso regression penalizes the number of coefficients, so it is indirectly doing variable selection.

See the help of Spark's Mllib.

How Mllib in Spark select variables in logistic regression

2 Answers2