9

Suppose I have some data set with two classes. I could draw a decision boundary around each data point belonging to one of these classes, and hence, separate the data, like so:

two dimensional data set with two classes

Where the red lines are the decision boundaries around the data points belonging to the star class.

Obviously this model overfits really badly, but nevertheless, have I not shown that this data set is separable?

I ask because in an exercise book, a question asks "Is the above data set separable? If it is, is it linearly separable or non-linearly separable?"

I would say "Yes it is separable, but non-linearly separable."

No answers are provided, so I'm not sure, but I think my logic seems reasonable.

The only exception I see is when two data points belong to different classes, but have identical features. For instance, if one of the stars in the figure above was perfectly overlapping one of the circles. I suppose this is quite rare in practise. Hence, I ask, is nearly all data separable?

Data
  • 467
  • 3
  • 12

3 Answers3

4

TL;DR

Yes, with overfitting all data becomes (non-linearly) separable (as long as the points don't precisely overlap).

Explanation

The problem with your argument is that you are using circles on a 2D plane, which is very difficult to learn. However, I think your argument can be made stronger with a decision-tree.

(0.2, 3.1)? --> yes -> star
            \-> no  -> (1.2, 4.5)? --> yes -> circle
                                   \-> no  -> (x1, x2)? --> yes ...
                                                        \-> no  ...

Decision trees are well accepted models, but note that they are non-linear models. With that, it is easy to argue that all data becomes separable.

However, the issue lies with overfitting. Because models like that become unstable with data points that were unseen before. So, just because the training data is separable, it doesn't mean that the models generated from it become any useful.

Bruno Lubascher
  • 3,618
  • 1
  • 14
  • 36
3

Having consulted my professor, the person that wrote the question from the exercise book featured in the OP, here is their perspective:

Groups of data points can always be separated. The exception is when two points are at the same location.

However, the thing to consider is whether or not your decision boundary can separate unseen data, generated by the same underlying distribution from which the training data came.

In the example shown in the question, the data is generated from a uniform random distribution. If we generate unseen data from the same distribution, you could draw a decision boundary anywhere, and your classifier would never perform significantly better than making random guesses when classifying this unseen data, e.g. using the outcome of a coin flip for classification.

So the classes from the example in the question are not separable.

Data
  • 467
  • 3
  • 12
1

Here is my stab at the answer: separation basially means that the types of cases are separated, but cases of the same type are not.

In your case I presume that the stars in your graph are of the same type, so they shouldn't be separated from one another, but conneted. In this case the data is not separable

If on the other hand you had eleven types of cases and each star in your graph would be of a separate type, your solution would be correct. In this case the data are separable but they are not linearly separable.

I like the answer @BrunoGL gave. Nevertheless the deision tree singles out every "star" case individually. The resulting overfitting is basically same as treating each star as a separate type and putting the together after in one class the classifiation (as the "non-cirle" class).

Nino Rode
  • 111
  • 4