0

I have a dataset which consists of attributes on breakdown of machines.The target variable is machine status which are populated with ones and zeros. The distribution of ones and zeros are given below

 0 - 19628
 1 - 225

0 - signifies the machine is running good and 1 signifies there was a breakdown.

Now, should I go by splitting the dataset using scikit train_test_split method ?. or introduce artificial rows to mitigate the tradeoff between ones and zeros and then split the dataset.

Well, What do I mean by artificial rows ? Populate some random data with having target variable as 1's But that would ultimately mislead the system. I don't see any other options or alternatives.

Is there any way how to make samples balanced?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
James K J
  • 477
  • 1
  • 5
  • 16

1 Answers1

1

SMOTE is a python library popularly used with unbalanced datasets like yours - it applies resampling to create a new balanced dataset. You can find some example implementation here.

Maybe you will also find this question and answers relevant, as it contains advice on limiting bias during modeling and validation of unbalanced data classifier.

missrg
  • 584
  • 2
  • 12