0

So I want to perform a predictive model to predict churn.

I have 2 datasets, one with churn and the other without (so I can later perform predictions).

The issue is that I think my Confusion matrix is kinda bad since my target variable is highly unbalanced:

enter image description here

which mostly leads to this confussion matrix:

enter image description here

(Similar values for both logistic regression and decision tree).

This is my workflow:

enter image description here

Is there any way to balance the data? I can't find it in the Orange documentation.

oW_
  • 6,502
  • 4
  • 29
  • 47
Daniel Rivas
  • 11
  • 2
  • 3

4 Answers4

1

For unbalanced classes use the Python Script widget, with imblearn add-on, you will need to code!

Link for thread in github

Example

0

Orange Data mining not suitable for most task due lacking features. U can use it if u are strong python coder, coz u have to add all those missing features in script widgets. Imbalanced datasets are not exception but reality. Orange don't handle them well, thus such AUC's no matter which regressor u might use, u get poor AUC on validation data as finding results by chance only. I have tested it a lot. I use optuna (cli tool) to determine HP on CatBoost with SMOTE - optuna provides high AUC. Then test HP on validation data, results - poor AUC around 50% model finds hits by chance only. ROC_AUC is poor metric on train and test data and says nothing about unseen data. The more imbalanced dataset u have the more poor AUC will be and always u get higher AUC on majority class(i haven't seen raw balanced data set ever). The only strength that Orange have is easy data manipulation and visualizations. Use Keras, PyTorch or H2O instead. Those should be integrated in Orange by my opinion, but they are not. Sadly if u want more out of data mining, coding is the only choice, chose some IDE with AI assistant, but AI assistants are also dumb in coding, u get about 30/70% usable help out of those many are commercial, some not but use commercial API keys, some IDE's have GUI's over blown beyond end user use of, so a lot of hurdles here .... and almost every time u might require pro help, u wont solve it alone ...

jenkie
  • 1
0

You can try to randomly delete samples from the majority class until there's a 50-50 split in the data. Then you can proceed to split 75% - 25% for training and testing.

You can also try to generate more Yes samples via imputation or whatever means may be relevant to the given dataset.

Sometimes you have to make the most out of the data you have.

Sterls
  • 160
  • 1
  • 7
0

To balance the classes check Stratified sampling checkbox in Data Sampler widget. Note, it only works when downsampling

K3---rnc
  • 3,582
  • 1
  • 14
  • 12