I'm looking for scalable tools to build kNN graph over sparse data points.
The dimension and number of data points can be both up to millions.
What I have tried already:
- sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time.
- flann: only supports dense arrays
- pysparnn: the running time is not very satisfatory (maybe because it's written in Python)
- knn search in mlpack: which only supports dense data
- scipy.spatial.KDTree: which converts the sparse data to dense one
- SparseLSH: which is implemented in Python, so I'm not quite sure about the scalability
- elasticsearch: it seems to only support indexing documents, instead of sparse features.
- the reason I thought of elasticsearch is: knn over sparse data can be framed as retrieving the top-k "documents" in IR.
Thanks for any comments/answers :)