Questions tagged [class-imbalance]

Questions referring to classifiers or classifying problems where some of the classes in the data are under-represented.

559 questions
59
votes
6 answers

Should I go for a 'balanced' dataset or a 'representative' dataset?

My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should choose a similar data setup for training my…
pnp
  • 693
  • 1
  • 6
  • 10
42
votes
6 answers

Unbalanced multiclass data with XGBoost

I have 3 classes with this distribution: Class 0: 0.1169 Class 1: 0.7668 Class 2: 0.1163 And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight. But how is it handled for 'multiclass' case, and how can…
35
votes
4 answers

Quick guide into training highly imbalanced data sets

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class. Some good answers…
IgorS
  • 5,474
  • 11
  • 34
  • 43
31
votes
4 answers

macro average and weighted average meaning in classification_report

I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification Classification Report : precision recall f1-score support 0 1.00…
user10296606
  • 1,906
  • 6
  • 18
  • 33
25
votes
3 answers

How do you apply SMOTE on text classification?

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
  • 369
  • 1
  • 3
  • 5
22
votes
4 answers

Train/Test Split after performing SMOTE

I am dealing with a highly unbalanced dataset so I used SMOTE to resample it. After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it. However, I am…
Edamame
  • 2,785
  • 5
  • 25
  • 34
20
votes
4 answers

Macro- or micro-average for imbalanced class problems

The question of whether to use macro- or micro-averages when the data is imbalanced comes up all the time. Some googling shows that many bloggers tend to say that micro-average is the preferred way to go, e.g.: Micro-average is preferable if there…
Krrr
  • 303
  • 1
  • 2
  • 6
19
votes
2 answers

Why does data science see class imbalance as a problem for supervised learning when statistics does not?

Why does data science see class imbalance as a problem in supervised learning when statistics says it is not? Data science seems to seem class imbalance as problematic and needing special techniques to remedy this problem. For instance, this DS.SE…
18
votes
3 answers

When should we consider a dataset as imbalanced?

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
  • 604
  • 2
  • 6
  • 16
16
votes
4 answers

What are the implications for training a Tree Ensemble with highly biased datasets?

I have a highly biased binary dataset - I have 1000x more examples of the negative class than the positive class. I would like to train a Tree Ensemble (like Extra Random Trees or a Random Forest) on this data but it's difficult to create training…
16
votes
4 answers

Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I…
Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51
15
votes
3 answers

How can I perform stratified sampling for multi-label multi-class classification?

I am asking this question for few reasons: The dataset in hand is imbalanced I used below code x = dataset[['Message']] y = dataset[['Label1', 'Label2']] train_data, test_data = train_test_split(x, test_size = 0.1, stratify=y, random_state =…
15
votes
2 answers

Why do we need to handle data imbalance?

I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE. For example, if I have a rare disease 1 percent out of 100, and…
sara
  • 481
  • 7
  • 15
14
votes
1 answer

Why doesn't class weight resolve the imbalanced classification problem?

I know that in imbalanced classification, the classifier tends to predict all the test labels as larger class label, but if we use class weight in loss function, it would be reasonable to expect the problem to be solved. So why we need some…
user137927
  • 389
  • 1
  • 3
  • 11
13
votes
3 answers

Unbalanced classes -- How to minimize false negatives?

I have a dataset that has a binary class attribute. There are 623 instances with class +1 (cancer positive) and 101,671 instances with class -1 (cancer negative). I've tried various algorithms (Naive Bayes, Random Forest, AODE, C4.5) and all of them…
1
2 3
37 38