You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by kno10 <gi...@git.apache.org> on 2015/08/13 00:42:39 UTC

[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...

Github user kno10 commented on the pull request:

    https://github.com/apache/flink/pull/696#issuecomment-130469648
  
    R-trees are hard to parallelize.
    For distributed and gigabyte size data, an approximative approach is preferable, like the one we discuss in this article:
    
    E. Schubert, A. Zimek, H.-P. Kriegel
    Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles
    In Proceedings of the 20th International Conference on Database Systems for Advanced Applications (DASFAA), Hanoi, Vietnam: 19–36, 2015. 
    
    We discuss an approach that is easy to parallelize. It needs sorting and a sliding window (or blocks), so it is not strict MapReduce, but it should be a good match for Flink. The hardest part is to get the different space filling curves right and efficient. The other components (random projections to reduce dimensionality, ensemble to improve quality, and list inversions to also build reverse kNN that then allow accelerating methods such as LOF are much easier).
    
    The main drawback of most of these kNN-join approaches (including ours) is that they only work with Minkowski norms. There are much more interesting distance functions than that...
    
    We also discuss why the space filling curves appear to give better results for kNN, while LSH etc. work better for radius joins. LSH is another option, but it cannot guarantee to find k neighbors and parameter tuning is tricky. So you may want to have a look at this recent ensemble approach instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---