How to choose the right threshold for binary classification?

Question

I am currently working on the titanic dataset from Kaggle. The data set is imbalanced with almost 61.5 % negative and 38.5 positive class.

I divided my training dataset into 85% train and 15% validation set. I chose a support vector classifier as the model. I did 10-fold Stratified cross-validation on the training set, and I tried to find the optimal threshold to maximize the f1 score for each of the folds. Averaging all of the thresholds obtained on the validation folds, the threshold has a mean of 35% +/- 10%.

After that, I test the model on the validation set and estimated the threshold for maximizing F1 score on the validation set. The threshold for the validation set is about 63%, which is very far from the threshold obtained during cross validation.

I tested the model on the holdout test set from Kaggle and I am unable to get a good score for both of the thresholds (35% from cross-validation of train set and 63% from the validation set.)

How does one determine the optimal threshold from the available dataset which could work well on unseen data? Do I choose the threshold obtained from cross-validation or from the validation set? or am I doing it completely wrong? I would appreciate any help and advice regarding this.

For this Dataset, I am looking to maximize my score on the scoreboard by getting the highest accuracy.

Thank you.

hH1sG0n3 · Answer 1 · 2023-01-05T10:38:20.443

In short, you should be the judge of that: depending on the precision (interested to minimise "false alarms/FP") and recall (interested to minimise "missed positives/FN") you want your classifier to have.

The appropriate way to look into precision-recall value pairs at different thresholds is a precision-recall curve (PRC) (especially if you want to focus on the minority class). Via a PRC, you can find the optimal threshold as far as model performance go as a function of precision and recall.

I copy below a pseudo-snippet:

from sklearn.metrics import precision_recall_curve
model.fit(trainX, trainy)
preds = model.predict_proba(testX)
calculate pr curve
precision, recall, thresholds = precision_recall_curve(labels, preds)
convert to f score
fscore = (2 * precision * recall) / (precision + recall)
locate the index of the largest f score
ix = argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))

sauce for code

The PRC would look like this:

You can alternatively follow the equivalent approach for ROC curves.

How to choose the right threshold for binary classification?

1 Answers1

calculate pr curve

convert to f score

locate the index of the largest f score

Linked