5

I need to calculate class-weights to train my deep learning model.

In order to simulate real-world producing scenario as possible as I can, I have excluded the testing/infering dataset from which calculate the class-weights, because when my model is being used for producing purpose, I can NOT re-calculate class-weights.

So I think that the testing/infering dataset should NOT be taked into account when calculating class-weights.

My question is: Should a validating-set be excluded also?

The way I'm splitting fit/valid/test sets, is by date. For example, data between 2021~2023 act as fitting-set, 2024's data as validating-set. While the distributions of the classes are approximately equal, I'm wondering whether or not it make sense. Thanks!

EvilRoach
  • 153
  • 4

1 Answers1

2

So I think that the testing/infering dataset should NOT be taked into account when calculating class-weights.

Correct - the test set is only used for the final evaluation of the model. It cannot be part of the modelling process in any way, otherwise it loses its value as an independent and unseen sample of data.

My question is: Should a validating-set be excluded also?

Yes - the class weights should only be computed using the training set. The validation set will be used to assess how well the resulting model performs on unseen data. If it performs badly on the validation set, you can use that finding to consider a different strategy.

The way I'm splitting fit/valid/test sets, is by date. For example, data between 2021~2023 act as fitting-set, 2024's data as validating-set.

That looks fine to me. Since timepoints close together could be highly correlated, you could consider keeping a temporal gap between the training set and validation set in order to make the validation samples more like truly unseen data. For example:

  • Training set: 2021 - 2023

  • Gap (not used): e.g. 1 week, or something based on your data's characteristics

  • Validation set: Starts after the gap (2023 + 1 week later)

  • Could make a similar consideration for the test set