3

I'd like to perform clustering analysis on a dataset with 1,300 columns and 500,000 rows.

I've seen that clustering algorithms are available in SciKit-Learn. But I'm worried that the algorithms will be inefficient on a dataset of this size.

Is SciKit-Learn slow, and, if it is, what's the best (fastest) clustering package available in Python?

Connor
  • 701
  • 6
  • 24

3 Answers3

3

Depending on your platform, processor, memory, etc, you may want to check out https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html Some of the clustering algorithms are highly optimized.

brewmaster321
  • 1,420
  • 1
  • 4
  • 10
3

Should be fairly easy to asses the computational requirements - just try it out without worrying about the accuracy of the model.

I don't know if the packages comes with the specific clustering algorithm you are searching for, but you can implement a very fast clustering method by accelerating Python with the GPU while making sure you are setting up the code to be parallelizable. This can be done with packages such as Numba, CuPy, PyTorch or PyCUDA.

nammerkage
  • 131
  • 4
1

I would go with HDBSCAN, a hierarchical version of the DBSCAN algo. It is not necessarily easy to install so might want to go with the sklearn DBSCAN implementation.

Lucas Morin
  • 2,775
  • 5
  • 25
  • 47