Partitioning millions of items into groups based upon a network of set similarities

Question

So I'm working on a problem at work related to the matching of authors of millions of documents. I currently have minhash sets for each document's syntax (sets of 10 numbers with 8-10 digits each), however I need to figure out the most efficient way to partition the documents by their minhash sets so that I can simply compare the documents that have vaguely similar minhash sets. I've arrived at the following problem:

Given a collection of millions of sets, with each set containing 10 numbers of 8-10 digits, how would you go about partitioning the collection in that all sets in the partition have at least 4 elements in common with at least 1 other set in the partition? No item in any partition can share 4 or more set elements in common with an item from any other partition.

Basic example:

Given the sets of...

A={1, 2, 3, 4, 5, 6, 7, 8}
B={21, 22, 23, 24, 25, 26, 27, 28}
C={41, 42, 43, 44, 45, 46, 47, 48}
D={1, 2, 3, 4, 41, 42, 43, 44}

...we should have the the sets separated into 2 partitions:

Partition_1 = {A, C, D}
Partition_2 = {B}

This, however, needs to scale for tens of millions of sets instead of just 4.

I'm happy to provide more details and answer any questions. I hope my question was clearly worded. Thanks for your responses!

Partitioning millions of items into groups based upon a network of set similarities

0 Answers0