1

Up to I know the usual way of thinking in machine learning approach is to split the data in a train and test subsets. The first one is for fitting the model (with the support of a validation subset) and the second one is for compute model performance under different metrics (normally minimizing a loss function). It makes sense do that in such way because we want be sure of there is not over-fitting, so the model should show good performance in the test set, not in the training set. We could also be interested in divide the data in k folds, leaving a test out in each fold and do the same as explained before(fit the model in train subset and compute performance in test) k times and finally take averages.

A year ago I decided to get into Bayesian approach and everything get messy to myself in regard to model validation. As I have studied up to know, on Bayesian approach is never used a test set. Instead of that, there are some based on Criterion Information metrics (BIC, AIC, WAIC, etc.) used to estimate the "deviance" on the test set.

I don't understand why so much effort in estimating the deviance in future data instead of leaving out-of-sample subset and computing deviance there. On another hand, I don't feel very comfortable using these information criterion metrics (BIC, AIC, WAIC, etc.) because they are computed on the train set so I always try to compute the deviance on test data, but I'm not feeling comfortable either, as nobody do in the community (everybody compute information criterion on train data-set)

I would like someone can enlighten me and get rid of doubts.

Any source for reading it will be welcome. Also, any other way of extending the questions would be also welcome, because I have been stuck thinking in this topic for long time and I have not found way for moving forward.

Thank you very much in advance.

PS When I said everybody, nobody... has meaning in figuratively sense.

  • Can you link to an example where test data is not used for model validation? – littleO Dec 28 '18 at 12:33
  • Take a look to all the solved exercises in this book:

    https://xcelab.net/rm/statistical-rethinking/

    or take a look this master thesis:

    https://brage.bibsys.no/xmlui/bitstream/handle/11250/2352708/13619_FULLTEXT.pdf?sequence=1&isAllowed=y

    anyway, almost study from bayesian approach does not use the test, always use information criterion error estimation.

    – Sergio Marrero Marrero Dec 28 '18 at 12:45

2 Answers2

3
  • Why do Bayesian Approach validate models using different way of thinking (WAIC, etc.) than ML community?

There are numerous metrics for validating models, depending upon application criteria. For instance, does one want to minimize the expected classification error? Or probability of the "worst" case error? Or include some measure of the complexity of the model (relevant for certain hardware implementations, for instance)?

  • Does it make sense from Bayesian Approach to compute deviance on the test set? In case affirmative, why do nobody do?

Some Bayesian researchers indeed compute the deviation.

  • Why ML community don't use the Bayesian Approach way of thinking for validate models?

Most ML researchers start with a foundation of Bayesian analysis and only deviate from it when it becomes too difficult to implement on realistic problems.

  • Thanks for your answer, but I do not feel free of doubts after read it. – Sergio Marrero Marrero Dec 28 '18 at 08:23
  • Nobody can reply to vague responses such as "but I do not feel free of doubts." You can consult my book *Pattern classification* for more details. Good luck in your search. – David G. Stork Dec 28 '18 at 17:05
  • Hi. Regard to your answers: 1)Why ML community ignore the information criteria methods used by Bayesian Statistics. Both communities try to do inference, but there are ways of testing data very different to each other. In ML you define a Loss function and test its perform on test. In Bayesian, they use the train set for trying to estimate deviance on test. 2) But why is used so much the information criteria as BIC, DIC, WAIC instead of leaving out a test? 3) The question is related to the testing step, not the foundation.

    Thank you again.

    PD Your book is awesome.

    – Sergio Marrero Marrero Jan 03 '19 at 13:51
1

(Note: I have posted a very similar answer to a very similar question here)

I am not an expert on this topic, but I often had to deal with model validation, in particular model selection. When working on this, I came up with the following argument for myself.

Imagine you have a model that has no free parameters to tune. This model may be, for instance, a straight line or a curve that is prescribed by domain knowledge. In this particular case, you would need no test set in order to evaluate your model: you would simply fit the model to the available data and measure how well it does (I realise that "well" is rather vague here) using some criterion such as likelihood (likelihood is one of the elements involved in the criteria you mention: AIC, BIC,..).

Bayesians set the goal of integrating out all free parameters as opposed to optimising them. If you are able to integrate out all parameters, which is analytically possible only in certain mathematically convenient cases (e.g. priors conjugate to likelihood), then you will be left with something that looks very similar to the case above: a model with no free parameters. It is my personal understanding that this situation, of integrating out all parameters, is very similar to the situation of having a model with no free parameters to tune. Hence, I will risk saying that, if you have integrated out all your free parameters, then you will obtain the marginal likelihood which will be your criterion of how much the observed data support your model and no other data, i.e. test data, will be necessary.

On the other hand, if instead of integrating out the free parameters we choose to optimise them, then we risk of overly adapting our model to the data (i.e. overfitting). Hence, in order to assess our model, we can't use the training data on which we have already adapted, but an independent dataset.

Applying the Bayesian method involves integrating out the free parameters. However, as aforementioned, this is rarely ever possible. Hence, approximations become necessary. Such approximations may involve integrating out only a subset of the parameters and optimising the rest, or perhaps only approximately integrating out the free parameters. Such approximations may adapt to the data. One type of approximation that does adapt to the data is empirical Bayes: typically empirical Bayes integrates out the parameters and optimises the hyperparameters, thus adapting to the data. In such cases, whenever we adapt to the data, it becomes again necessary to subject the approximations to cross-validation schemes (see for example the work by Aki Vehtari on LOO-CV in the Bayesian setting).

ngiann
  • 111