1

I'm working on a dataset that isn't split into test and train set by default and I'm a bit concerned about the imbalance between the 'label' distributions between them and how they might affect the trained model's performance. Let me note that I use deep neural networks and the prediction type is regression.

By sequentially splitting the samples into test/train (20/80) I get the following distributions respectively.

test labels train labels

I'm worried, since model performance is not improving by tuning hyperparamaters, if I'm generally allowed to try random seeds for shuffling my dataset till test/train label distributions look alike. Are there any references/best practices for that? I'm not concerned about compromising the ability to generalize/overfitting since I don't base my splitting into the form of the input data, rather the predicted outcome.

civy
  • 111

1 Answers1

1

Why don't you simply stratify your train test split? Sklearn provides a built-in function/option (stratify=) to do this:

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify=y, test_size=0.2, random_state=12)
Peter
  • 7,896
  • 5
  • 23
  • 50