What's the fastest clustering package in Python?

Question

I'd like to perform clustering analysis on a dataset with 1,300 columns and 500,000 rows.

I've seen that clustering algorithms are available in SciKit-Learn. But I'm worried that the algorithms will be inefficient on a dataset of this size.

Is SciKit-Learn slow, and, if it is, what's the best (fastest) clustering package available in Python?

score 3 · Accepted Answer · answered Mar 09 '23 at 09:11

3

Depending on your platform, processor, memory, etc, you may want to check out https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html Some of the clustering algorithms are highly optimized.

answered Mar 09 '23 at 09:11

brewmaster321

1,420
1
4
10

nammerkage · Answer 2 · 2023-03-20T13:27:46.467

Should be fairly easy to asses the computational requirements - just try it out without worrying about the accuracy of the model.

I don't know if the packages comes with the specific clustering algorithm you are searching for, but you can implement a very fast clustering method by accelerating Python with the GPU while making sure you are setting up the code to be parallelizable. This can be done with packages such as Numba, CuPy, PyTorch or PyCUDA.

score 1 · Answer 3 · answered Mar 18 '23 at 12:28

1

I would go with HDBSCAN, a hierarchical version of the DBSCAN algo. It is not necessarily easy to install so might want to go with the sklearn DBSCAN implementation.

answered Mar 18 '23 at 12:28

Lucas Morin

2,775
5
25
47

What's the fastest clustering package in Python?

3 Answers3