11

I've been analyzing a data set of ~400k records and 9 variables The dependent variable is binary. I've fitted a logistic regression, a regression tree, a random forest, and a gradient boosted tree. All of them give virtual identical goodness of fit numbers when I validate them on another data set.

Why is this so? I'm guessing that it's because my observations to variable ratio is so high. If this is correct, at what observation to variable ratio will different models start to give different results?

IgorS
  • 5,474
  • 11
  • 34
  • 43
JenSCDC
  • 327
  • 1
  • 11

4 Answers4

8

This results means that whatever method you use, you are able to get reasonably close to the optimal decision rule (aka Bayes rule). The underlying reasons have been explained in Hastie, Tibshirani and Friedman's "Elements of Statistical Learning". They demonstrated how the different methods perform by comparing Figs. 2.1, 2.2, 2.3, 5.11 (in my first edition -- in section on multidimensional splines), 12.2, 12.3 (support vector machines), and probably some others. If you have not read that book, you need to drop everything RIGHT NOW and read it up. (I mean, it isn't worth losing your job, but it is worth missing a homework or two if you are a student.)

I don't think that observations to variable ratio is the explanation. In light of my rationale offered above, it is the relatively simple form of the boundary separating your classes in the multidimensional space that all of the methods you tried have been able to identify.

StasK
  • 360
  • 1
  • 5
5

its worth also looking at the training errors.

basically I disagree with your analysis. if logistic regression etc are all giving the same results it would suggest that the 'best model' is a very simple one (that all models can fit equally well - eg basically linear).

So then the question might be why is the best model a simple model?: It might suggest that your variables are not very predictive. Its of course hard to analyse without knowing the data.

seanv507
  • 800
  • 3
  • 12
1

I'm guessing that it's because my observations to variable ratio is so high.

I think this explanation makes perfect sense.

If this is correct, at what observation to variable ratio will different models start to give different results?

This will probably depend very much on your specific data (for instance, even whether your nine variables are continuous, factors, ordinary or binary), as well as any tuning decisions you made while fitting your model.

But you can play around with the observation-to-variable ratio - not by increasing the number of variables, but by decreasing the number of observations. Randomly draw 100 observations, fit models and see whether different models yield different results. (I guess they will.) Do this multiple times with different samples drawn from your total number of observations. Then look at subsamples of 1,000 observations... 10,000 observations... and so forth.

Stephan Kolassa
  • 1,411
  • 1
  • 12
  • 15
1

As @seanv507 suggested, the similar performance may simply be due to the data being best separated by a linear model. But in general, the statement that it is because the "observations to variable ratio is so high" is incorrect. Even as your ratio of sample size to number of variables goes to infinity, you should not expect different models to perform nearly identically, unless they all provide the same predictive bias.

bogatron
  • 846
  • 5
  • 4