Suppose I have 100 positive samples. How many negative samples do I need to have in order to make the classifier work the best. In many papers, I have noticed that they take 4 times or 5 times the number of positive data sample to get the negative data sample. Will such a data set be useful?
Asked
Active
Viewed 1.3k times
2 Answers
1
It is recommended to have around equal number of instances in each class. If not, then you should normalize the data. 4 or 5 times more positive class size could produce highly biased classifier. However, if your negative class is not well defined then there are ways to train your classes with only one class as mentioned in the paper, Learning from positive examples when the negative class is undetermined- microRNA gene identification.
Shayan Shafiq
- 1,008
- 4
- 13
- 24
Mangat Rai Modi
- 579
- 1
- 5
- 11
1
I guess you are not limited to these 100 samples. Generate more, and let each 5th be negative. Then reduce number of positives by random removing 4/5 of them.
And check this out Training imbalanced data set
This is small quantity, you'd better have 50:50 negative vs positive.