14

I have a group of n sets for which I need to calculate a sort of "uniqueness" or "similarity" value. I've settled on the Jaccard index as a suitable metric. Unfortunately, the Jaccard index only operates on two sets at a time. In order to calculate the similarity between all $n$ sets, it will require in the order of $n^2$ Jaccard calculations.

(If it helps, $n$ is usually between 10 and 10000, and each set contains on average 500 elements. Also, in the end, I don't care how similar any two specific sets are - rather, I only care what the internal similarity of the whole group of sets is. (In other words, the mean (or at least a sufficiently accurate approximation of the mean) of all Jaccard indexes in the group))

Two questions:

  1. Is there a way to still use the Jaccard index without the $n^2$ complexity?
  2. Is there a better way to calculate set similarity/uniqueness across a group of sets than the way I've suggested above?
rinogo
  • 241
  • 2
  • 3

2 Answers2

4

An option would be to use the Signature Scheme of [1], size-based filtering: a scheme which uses size information to reduce the number of set pairs that need to be considered.

They also experiment with a weighted form; where weights are IDF-based.

[1] Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik. “Efficient Exact Set-similarity Joins.” In Proceedings of the 32nd International Conference on Very Large Data Bases, 918–929. VLDB ’06. VLDB Endowment, 2006

A T
  • 978
  • 1
  • 9
  • 21
0

Another option would be to employ local sensitivity hashing wiki link. I've seen it being used in community similarity detection by Wu and Zou (An incremental community detection method for social tagging systems using locality-sensitive hashing, Neural Networks 58:14–28; ACM DL) which is basically detecting similarity between integer or string sets.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
dinos66
  • 101