50

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

In my research, I came upon articles defending both standpoints. Keras seems to default to the validation loss but I have also come across convincing answers for the opposite approach (e.g. here).

Anyone has directions on when to use preferably the validation loss and when to use a specific metric?

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
qmeeus
  • 1,389
  • 1
  • 11
  • 13

6 Answers6

38

TLDR; Monitor the loss rather than the accuracy

I will answer my own question since I think that the answers received missed the point and someone might have the same problem one day.

First, let me quickly clarify that using early stopping is perfectly normal when training neural networks (see the relevant sections in Goodfellow et al's Deep Learning book, most DL papers, and the documentation for keras' EarlyStopping callback).

Now, regarding the quantity to monitor: prefer the loss to the accuracy. Why? The loss quantify how certain the model is about a prediction (basically having a value close to 1 in the right class and close to 0 in the other classes). The accuracy merely account for the number of correct predictions. Similarly, any metrics using hard predictions rather than probabilities have the same problem.

Obviously, whatever metrics you end up choosing, it has to be calculated on a validation set and not a training set (otherwise, you are completely missing the point of using EarlyStopping in the first place)

qmeeus
  • 1,389
  • 1
  • 11
  • 13
7

In my opinion, this is subjective and problem specific. You should use whatever is the most important factor in your mind as the driving metric, as this might make your decisions on how to alter the model better focussed.

Most metrics one can compute will be correlated/similar in many ways: e.g. if you use MSE for your loss, then recording MAPE (mean average percentage error) or simple $L_1$ loss, they will give you comparable loss curves.

For example, if you will report an F1-score in your report/to your boss etc. (and assuming that is what they really care about), then using that metric could make most sense. The F1-score, for example, takes precision and recall into account i.e. it describes the relationship between two more fine-grained metrics.

Bringing those things together, computing scores other than normal loss may be nice for the overview and to see how your final metric is optimised over the course of the training iterations. That relationship could perhaps give you a deeper insight into the problem,

It is usually best to try several options, however, as optimising for the validation loss may allow training to run for longer, which eventually may also produce a superior F1-score. Precision and recall might sway around some local minima, producing an almost static F1-score - so you would stop training. If you had been optimising for pure loss, you might have recorded enough fluctuation in loss to allow you to train for longer.

n1k31t4
  • 15,468
  • 2
  • 33
  • 52
2

Usually a loss function is just a surrogate one because we cannot optimize directly the metric. If the metric is representative of the task(business value the best), the value of the metric on evaluation dataset would be better than the loss on that dataset. For instance, if data imbalance is a serious problem, try PR curve.

Lerner Zhang
  • 536
  • 3
  • 10
0

As n1k31t4 pointed out this is rather problem specific, but I would like to suggest a few points to consider:

  1. The loss is designed to help your model convarge, whereas a validation metric is usually what best describes the performance of the model.
  2. Validation metrics are more "stable", as they repressent "buisness logic" rather than a technical tool:
  • Losses can often change (unless of course you're dealing with a very standard task). You may choose different losses and/or combinations of losses. The same losses can slightly change how they are aggregated or some loss related hyper-parameter change.
  • In some uncommon cases, the loss may even dynamically change during the training.
  • You will need to calibrate your stopping criteria each time.
Mark Loyman
  • 151
  • 1
  • 4
-1

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

If you are training a deep network, I highly recommend you not to use early stop. In deep learning, it is not very customary. Instead, you can employ other techniques like drop out for generalizing well. If you insist on that, choosing criterion depends on your task. If you have unbalanced data, you have to employ F1 score and evaluate it on your cross-validation data. If you have balanced data, try to use accuracy on your cross-validation data. Other techniques highly depend on your task.

I highly encourage you to find a model which fits your data very well and employ drop out after that. This is the most customary thing people use for deep models.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
-2

I'm no expect, but i believe there is no objective answer as it is context dependent. e.g. I am trying to work out the best metric to apply to a pneumonia classifier, but I know FNRs (missing pneumonia) is WAY worse than accidentally diagnosing a healthy person who, at worst, receives useless treatement. FNRs can kill in this example. I opted for F1 in the end as i figured the harmonic mean between recll and precision is useful to improve the model as it penalises both false positives/negatives. But if I was a REAL hospital/doctor worried about being sued, I may want to reduce FNR rate as much as possible because one FN can lead to serious harm to life and business whereas a a FP ist just wasted time and money.