Most discussions of KNN mention Euclidean,Manhattan and Hamming distances, but they dont mention cosine similarity metric. Is there a reason for this?
4 Answers
Short answer: Cosine distance is not the overall best performing distance metric out there
Although similarity measures are often expressed using a distance metric, it is in fact a more flexible measure as it is not required to be symmetric or fulfill the triangle inequality. Nevertheless, it is very common to use a proper distance metric like the Euclidian or Manhattan distance when applying nearest neighbour methods due to their proven performance on real world datasets. They will therefore be often mentioned in discussions of KNN.
You might find this review from 2017 informative, it attempts to answer the question "which distance measures to be used for the KNN classifier among a large number of distance and similarity measures?" They also consider inner-product metrics like the cosine distance.
In short, they conclude that (no surprise) no optimal distance metric can be used for all types of datasets, as the results show that each dataset favors a specific distance metric, and this result complies with the no-free-lunch theorem. It is clear that, among the metrics tested, the cosine distance isn't the overall best performing metric and even performs among the worst (lowest precision) in most noise levels. It does however outperform other tested distances in 3/28 datasets.
So can I use cosine similarity as a distance metric in a KNN algorithm? Yes, and for some datasets, like Iris, it should even yield better performance (p.30) compared as to Euclidian.
- 306
- 1
- 5
as Lejafar mentioned cosine violates triangle inequality however maybe this repo will help you
- 11
- 1
Although cosine similarity is not a proper distance metric as it fails the triangle inequality, it can be useful in KNN.
However, be wary that the cosine similarity is greatest when the angle is the same: cos(0º) = 1, cos(90º) = 0. Therefore, you may want to use sine or choose the neighbours with the greatest cosine similarity as the closest.
- 111
- 2
If there does exist a reason it probably has to do with the fact the Cosine distance is not a proper distance metric. Nevertheless, it's still a useful little thing.
- 2,470
- 12
- 16