7

I'm creating a multilabel classification approach based on sentence embeddings applied to text taken from a chatbot. We have the following:

  • a training dataset of 2,500 lines, where each line is a sentence associated with a particular label (the same sentence can end up with several labels)
  • a production dataset made up of sentences taken from our previous chatbot version. Each sentence has been carefully annotated with the corresponding labels. (The number of labels per sentence ranges from 0 to 5 in most cases). There are about 1000 annotated sentences in this dataset.

I'd like to maximize the performance and generalizability of the approach, but I have the following limitations:

  • the production dataset doesn't cover all possible labels, as some labels are rare but still important to detect
  • the training dataset covers all labels, but has been created from imaginary sentences.

The current process is as follows:

  • Evaluate performance with MultilabelStratifiedKFold on the training dataset
  • Select the best model and train the final model with all data
  • Evaluate performance on the production dataset and choose a threshold to separate what the model considers relevant from the rest (for the moment, I'm choosing the point that balances micro averaged precision and micro averaged recall on the production dataset).

My question is this: When you choose the threshold, how do you avoid overfitting as much as possible in this context? We don't want overfitting on either the training or production datasets. Wouldn't optimizing the threshold on the production dataset bias the approach towards the labels available in the production dataset? In the same way, optimizing the threshold on the training dataset might bias the approach towards less "realistic" phrases from the real production context.

What do you think?

LasiusMind
  • 71
  • 3

2 Answers2

6

Precision, recall and other KPIs that rely on thresholding probabilistic predictions all suffer from the same issues as accuracy: Why is accuracy not the best measure for assessing classification models?

When you threshold your probabilistic classifications, you are essentially deciding on how to treat a given instance, given the probabilistic classification your model assigned to it. That is, you are deciding on an action or a decision. There can be more than two reasonable actions even if there are only two classes: a low probability of a disease could mean "discharge", a medium probability "take two aspirin and see me in the morning", a high probability "isolate immediate, call the authorities".

Which action is best in a given instance can never depend on the predictor data and the model alone. It must always include the cost of actions. Calling out the people in hazmat suits for a possible Ebola outbreak that actually isn't (a false positive) is more costly than sending a marketing email to someone who your model falsely identified as likely to buy (also a false positive) entails very different costs.

Therefore you need to think about what actions or decisions your thresholding implies, and about what costs the different actions will entail, see here for more.

Yes, this is a complex undertaking. But simply using precision, recall etc. sweeps all this under the rug, by assuming one very specific cost structure without disclosing this.

An alternative would be to stick with the probabilistic predictions and use proper scoring rules to assess these. This makes sense especially if you can't get a good handle on the costs, or if different users have widely divergent ideas about them.

Stephan Kolassa
  • 1,411
  • 1
  • 12
  • 15
2

I would suggest you to validate your data properly. For example. Make sure that you have the same number of labels in train and validation set. Just in case if you don't know how to do it, use the following:

from sklearn.model_selection import train_test_split
train_dataset, validation_dataset = train_test_split(df, test_size = 0.2, random_state = 2023, stratify = df['target'])

This will create a valid dataset which can be used for validating your model. After that, you can use StratifiedKFold to make sure that it's not overfitting.

After using StratifiedKFold, you will have predictions on your validation_data. Keep that predictions and use them to choose your threshold value. While choosing a threshold value, use the following:

def calculate_scores(y_true, y_probs):
    thresholds = np.arange(0, 1, 0.1)
    best_precision = 0
    best_threshold = 0
for threshold in thresholds:
    y_pred = [1 if prob > threshold else 0 for prob in y_probs]
    precision = precision_score(y_true, y_pred)


    if precision > best_precision:
        best_precision = precision
        best_threshold = threshold

return best_precision, best_threshold

This will give you the threshold value for the best precision score. And you can modify it with whichever metric that you are using.

Harshad Patil
  • 1,068
  • 1
  • 4
  • 13