So I'm working on a problem at work related to the matching of authors of millions of documents. I currently have minhash sets for each document's syntax (sets of 10 numbers with 8-10 digits each), however I need to figure out the most efficient way to partition the documents by their minhash sets so that I can simply compare the documents that have vaguely similar minhash sets. I've arrived at the following problem:
Given a collection of millions of sets, with each set containing 10 numbers of 8-10 digits, how would you go about partitioning the collection in that all sets in the partition have at least 4 elements in common with at least 1 other set in the partition? No item in any partition can share 4 or more set elements in common with an item from any other partition.
Basic example:
Given the sets of...
A={1, 2, 3, 4, 5, 6, 7, 8}
B={21, 22, 23, 24, 25, 26, 27, 28}
C={41, 42, 43, 44, 45, 46, 47, 48}
D={1, 2, 3, 4, 41, 42, 43, 44}
...we should have the the sets separated into 2 partitions:
Partition_1 = {A, C, D}
Partition_2 = {B}
This, however, needs to scale for tens of millions of sets instead of just 4.
I'm happy to provide more details and answer any questions. I hope my question was clearly worded. Thanks for your responses!