Hyperparameter tuning in multiclass classification problem: which scoring metric?

Question

I'm working with an imbalanced multi-class dataset. I try to tune the parameters of a DecisionTreeClassifier, RandomForestClassifier and a GradientBoostingClassifier using a randomized search and a bayesian search.

For now, I used just accuracy for the scoring which is not really applicable for assessing my models performance (which I'm not doing). Is it also not suitable for parameter tuning?

I found that for example recall_micro and recall_weighted yield the same results as accuracy. This should be the same for other metrics like f1_micro.

So my question is: Is the scoring relevant for tuning? I see that recall_macro leads to lower results since it doesn't take the number of samples per class into account. So which metric should I use?

Brian Spiering · Answer 1 · 2022-03-19T21:23:48.753

The evaluation metric depends on the goals of the project. Which outcomes are better or which outcomes worse? Some projects value precision over recall and other projects value recall over precision.

After you have clarity on project goals, pick a single metric to provide a consistent scorecard when comparing different algorithms and hyperparameters combinations. One common evaluation metric for multi-class classification is F-score. F-score has a β hyperparameter which weights recall and precision differently. You will have to choose between micro-averaging (biased by class frequency) or macro-averaging (taking all classes as equally important). For macro-averaging, two different formulas can be used:

The F-score of (arithmetic) class-wise precision and recall means.
The arithmetic mean of class-wise F-scores (often more desirable).

score 1 · Answer 2 · answered Sep 17 '20 at 20:46

You should use the same metric to evaluate and to tune the classifiers. If you wull evaluate the final classifier using accuracy, then you must use accuracy to tune the hyper parameters. If you think you should use macro-averaged F1 as the final evaluation of the classifier, use it also to tune them.

On a side, for multiclass problems I have not yet heard any convincing argument not to use accuracy, but that is just me.

score 0 · Answer 3 · answered Sep 18 '20 at 04:22

0

If your dataset is imbalance then you can calculate the kappa score.

answered Sep 18 '20 at 04:22

Rina

169
1
13

score 0 · Answer 4 · answered Oct 19 '20 at 16:02

A simple solution is to set importance weight in front of each class inversely proportional to the train set relative frequency of the class like $\frac{1}{freq}$ or $e^{-freq}$. The choice of the right formula depends on how much importance you would give to less frequent classes
e.g. $e^{-freq}$ give more importance to less frequent classes than $\frac{1}{freq}$

Hyperparameter tuning in multiclass classification problem: which scoring metric?

4 Answers4