Unbalanced target variable in Orange, how do I balance it?

Question

So I want to perform a predictive model to predict churn.

I have 2 datasets, one with churn and the other without (so I can later perform predictions).

The issue is that I think my Confusion matrix is kinda bad since my target variable is highly unbalanced:

which mostly leads to this confussion matrix:

(Similar values for both logistic regression and decision tree).

This is my workflow:

Is there any way to balance the data? I can't find it in the Orange documentation.

Jorge Barreto · Answer 1 · 2019-09-11T23:52:12.473

1

For unbalanced classes use the Python Script widget, with imblearn add-on, you will need to code!

Link for thread in github

edited Sep 11 '19 at 23:52

answered Sep 11 '19 at 23:40

Jorge Barreto

11
2

score 0 · Answer 2 · answered Jun 03 '25 at 23:17

Orange Data mining not suitable for most task due lacking features. U can use it if u are strong python coder, coz u have to add all those missing features in script widgets. Imbalanced datasets are not exception but reality. Orange don't handle them well, thus such AUC's no matter which regressor u might use, u get poor AUC on validation data as finding results by chance only. I have tested it a lot. I use optuna (cli tool) to determine HP on CatBoost with SMOTE - optuna provides high AUC. Then test HP on validation data, results - poor AUC around 50% model finds hits by chance only. ROC_AUC is poor metric on train and test data and says nothing about unseen data. The more imbalanced dataset u have the more poor AUC will be and always u get higher AUC on majority class(i haven't seen raw balanced data set ever). The only strength that Orange have is easy data manipulation and visualizations. Use Keras, PyTorch or H2O instead. Those should be integrated in Orange by my opinion, but they are not. Sadly if u want more out of data mining, coding is the only choice, chose some IDE with AI assistant, but AI assistants are also dumb in coding, u get about 30/70% usable help out of those many are commercial, some not but use commercial API keys, some IDE's have GUI's over blown beyond end user use of, so a lot of hurdles here .... and almost every time u might require pro help, u wont solve it alone ...

score 0 · Answer 3 · answered Apr 26 '19 at 03:02

You can try to randomly delete samples from the majority class until there's a 50-50 split in the data. Then you can proceed to split 75% - 25% for training and testing.

You can also try to generate more Yes samples via imputation or whatever means may be relevant to the given dataset.

Sometimes you have to make the most out of the data you have.

score 0 · Answer 4 · answered Apr 26 '19 at 12:43

0

To balance the classes check Stratified sampling checkbox in Data Sampler widget. Note, it only works when downsampling

answered Apr 26 '19 at 12:43

K3---rnc

3,582
1
14
12

Unbalanced target variable in Orange, how do I balance it?

4 Answers4