How to address label imbalance in deciding train/test splits?

Question

I'm working on a dataset that isn't split into test and train set by default and I'm a bit concerned about the imbalance between the 'label' distributions between them and how they might affect the trained model's performance. Let me note that I use deep neural networks and the prediction type is regression.

By sequentially splitting the samples into test/train (20/80) I get the following distributions respectively.

I'm worried, since model performance is not improving by tuning hyperparamaters, if I'm generally allowed to try random seeds for shuffling my dataset till test/train label distributions look alike. Are there any references/best practices for that? I'm not concerned about compromising the ability to generalize/overfitting since I don't base my splitting into the form of the input data, rather the predicted outcome.

score 1 · Answer 1 · answered Feb 09 '22 at 13:09

Why don't you simply stratify your train test split? Sklearn provides a built-in function/option (stratify=) to do this:

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify=y, test_size=0.2, random_state=12)

How to address label imbalance in deciding train/test splits?

1 Answers1