6

I'm facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I'm looking for ways to reduce the time my algorithm is running.

I want to try a few different approaches, like pre-clustering (canopy clustering) or subspace clustering, correlation clustering etc.

However, something that I haven't heard about, and I wonder why - Is it viable to simply get a representative sample from my dataset, run the clustering on that, and generalize this model to the whole dataset? Why/why not is this a viable approach? Thank you!

lte__
  • 1,379
  • 5
  • 19
  • 29

2 Answers2

4

I would get a sufficiently large random/representative sample and cluster that.

To see what is such a sample, you will have to get two such samples and cluster them to get cluster solutions c1 and c2. If the matching clusters of c1 and c2 have the same model parameters, then you probably have representative samples.

You can match the clusters by looking at how c1 and c2 assign drawn data to clusters.

kangaroo_cliff
  • 382
  • 1
  • 10
2

It's definitely viable, just that there is catch 22.

In order to get this representative sample from your dataset, you have to sample from every cluster. But if you already can sample from every cluster, you already know them, hence you don't need unsupervised learning.

Shayan Shafiq
  • 1,008
  • 4
  • 13
  • 24
Noah Weber
  • 5,829
  • 1
  • 13
  • 26