9

Suppose that I am interested in three classes $c_1$, $c_2$, $c_3$. But my dataset actually contains several more real classes $(c_j)_{j=4}^n$.

The obvious answer is to define a new class $\hat c_4$ that refers to all classes $c_j$, $j>3$ but I suspect this is not a good idea since the samples in $\hat c_4$ will be rare and not very similar to each other.

To visualize what I'm trying to say, suppose I have the following two variable space and the classes $c_1$, $c_2$, $c_3$, $\hat c_4= \bigcup_{j=4}^n c_j$ are depicted in red, til, green and black respectively. This is how I suspect my data would look like.

enter image description here

Is there any standard way to approach this problem? What would be the most efficient classifier and why?

h3h325
  • 253
  • 1
  • 6

1 Answers1

4

I would use a two-step approach, using the idea of the $\hat{c_4}$ class you mentioned.

In the first step, use a binary classifier(trained on the whole dataset) to decide if a sample belongs to the class $\hat{c_4}$ (i.e. in any non-interesting class). For this, step you could also take a look in outlier detection methods, if the samples belonging in the "interesting" classes are much different than the rest.

If the result is negative, move on to the next step, a new classifier trained only on samples belonging in the classes $c_1,c_2,c_3$ and use that prediction as your final one.

I think that even using a simple clustering approach as a first step (e.g. 4-clustering k-means using as initial centroid values the average centroid $cent_j = \frac{\sum\limits_{x_i\in D: y_i=j}x_i}{\sum\limits_{x_i\in D: y_i=j}1}$ for each $c_1,c_2,c_3, \hat{c_4}$), would still be useful.

Bogas
  • 596
  • 2
  • 8