2

I'm looking for scalable tools to build kNN graph over sparse data points.

The dimension and number of data points can be both up to millions.

What I have tried already:

  • sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time.
  • flann: only supports dense arrays
  • pysparnn: the running time is not very satisfatory (maybe because it's written in Python)
  • knn search in mlpack: which only supports dense data
  • scipy.spatial.KDTree: which converts the sparse data to dense one
  • SparseLSH: which is implemented in Python, so I'm not quite sure about the scalability
  • elasticsearch: it seems to only support indexing documents, instead of sparse features.
    • the reason I thought of elasticsearch is: knn over sparse data can be framed as retrieving the top-k "documents" in IR.

Thanks for any comments/answers :)

xiaohan2012
  • 163
  • 4

1 Answers1

2

I think what you might be looking for is,

L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning

They have multiple runtime options specifically for different kinds of datasets (including sparse data). The link for the same is : http://glaros.dtc.umn.edu/gkhome/node/1162

Sujay_K
  • 111
  • 2