I need to compare 2 multi-class classifiers. So, to assess whether the difference between the two are statistically significant I have taken the following steps:
- obtain prediction on test data using model 1
- obtain prediction on test data using model 2
- construct a confusion matrix between predictions from model 1 and predictions from model 2
- use the Stuart Maxwell Test to test marginal homogeneity and in this way, compare whether the difference between the 2 classifiers are significant
Will this be a correct way to tackle this task?
I have chosen this approach since the dataset I am using is large(~1 mil records) and my target variable has 10 classes. The dataset has been split into train/test/validation. In his 1998 paper, Thomas Dietterich recommended the McNemar’s test in those cases where it is expensive or impractical to use cross-validation. Since Stuart Maxwell Test is an alternative to McNemar’s test when it comes to more than 2 outcomes, I have chosen it to test my models.
I would really appreciate any opinion/advice on this!
Thank you!