Questions referring to classifiers or classifying problems where some of the classes in the data are under-represented.
Questions tagged [class-imbalance]
559 questions
59
votes
6 answers
Should I go for a 'balanced' dataset or a 'representative' dataset?
My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should choose a similar data setup for training my…
pnp
- 693
- 1
- 6
- 10
42
votes
6 answers
Unbalanced multiclass data with XGBoost
I have 3 classes with this distribution:
Class 0: 0.1169
Class 1: 0.7668
Class 2: 0.1163
And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.
But how is it handled for 'multiclass' case, and how can…
shda
- 585
- 1
- 5
- 10
35
votes
4 answers
Quick guide into training highly imbalanced data sets
I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class.
Some good answers…
IgorS
- 5,474
- 11
- 34
- 43
31
votes
4 answers
macro average and weighted average meaning in classification_report
I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification
Classification Report :
precision recall f1-score support
0 1.00…
user10296606
- 1,906
- 6
- 18
- 33
25
votes
3 answers
How do you apply SMOTE on text classification?
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
- 369
- 1
- 3
- 5
22
votes
4 answers
Train/Test Split after performing SMOTE
I am dealing with a highly unbalanced dataset so I used SMOTE to resample it.
After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it.
However, I am…
Edamame
- 2,785
- 5
- 25
- 34
20
votes
4 answers
Macro- or micro-average for imbalanced class problems
The question of whether to use macro- or micro-averages when the data is imbalanced comes up all the time.
Some googling shows that many bloggers tend to say that micro-average is the preferred way to go, e.g.:
Micro-average is preferable if there…
Krrr
- 303
- 1
- 2
- 6
19
votes
2 answers
Why does data science see class imbalance as a problem for supervised learning when statistics does not?
Why does data science see class imbalance as a problem in supervised learning when statistics says it is not?
Data science seems to seem class imbalance as problematic and needing special techniques to remedy this problem.
For instance, this DS.SE…
Dave
- 4,542
- 1
- 10
- 35
18
votes
3 answers
When should we consider a dataset as imbalanced?
I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced.
My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
- 604
- 2
- 6
- 16
16
votes
4 answers
What are the implications for training a Tree Ensemble with highly biased datasets?
I have a highly biased binary dataset - I have 1000x more examples of the negative class than the positive class. I would like to train a Tree Ensemble (like Extra Random Trees or a Random Forest) on this data but it's difficult to create training…
gallamine
- 428
- 3
- 8
16
votes
4 answers
Why SMOTE is not used in prize-winning Kaggle solutions?
Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios.
But then, when I…
Carlos Mougan
- 6,430
- 2
- 20
- 51
15
votes
3 answers
How can I perform stratified sampling for multi-label multi-class classification?
I am asking this question for few reasons:
The dataset in hand is imbalanced
I used below code
x = dataset[['Message']]
y = dataset[['Label1', 'Label2']]
train_data, test_data = train_test_split(x, test_size = 0.1, stratify=y, random_state =…
Divyanshu Shekhar
- 587
- 1
- 5
- 15
15
votes
2 answers
Why do we need to handle data imbalance?
I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE.
For example, if I have a rare disease 1 percent out of 100, and…
sara
- 481
- 7
- 15
14
votes
1 answer
Why doesn't class weight resolve the imbalanced classification problem?
I know that in imbalanced classification, the classifier tends to predict all the test labels as larger class label, but if we use class weight in loss function, it would be reasonable to expect the problem to be solved. So why we need some…
user137927
- 389
- 1
- 3
- 11
13
votes
3 answers
Unbalanced classes -- How to minimize false negatives?
I have a dataset that has a binary class attribute. There are 623 instances with class +1 (cancer positive) and 101,671 instances with class -1 (cancer negative).
I've tried various algorithms (Naive Bayes, Random Forest, AODE, C4.5) and all of them…
user798275
- 293
- 2
- 3
- 5