You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/08/13 00:42:46 UTC
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

    [ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694338#comment-14694338 ] 

ASF GitHub Bot commented on FLINK-1745:
---------------------------------------

Github user kno10 commented on the pull request:

    https://github.com/apache/flink/pull/696#issuecomment-130469648
  
    R-trees are hard to parallelize.
    For distributed and gigabyte size data, an approximative approach is preferable, like the one we discuss in this article:
    
    E. Schubert, A. Zimek, H.-P. Kriegel
    Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles
    In Proceedings of the 20th International Conference on Database Systems for Advanced Applications (DASFAA), Hanoi, Vietnam: 19–36, 2015. 
    
    We discuss an approach that is easy to parallelize. It needs sorting and a sliding window (or blocks), so it is not strict MapReduce, but it should be a good match for Flink. The hardest part is to get the different space filling curves right and efficient. The other components (random projections to reduce dimensionality, ensemble to improve quality, and list inversions to also build reverse kNN that then allow accelerating methods such as LOF are much easier).
    
    The main drawback of most of these kNN-join approaches (including ours) is that they only work with Minkowski norms. There are much more interesting distance functions than that...
    
    We also discuss why the space filling curves appear to give better results for kNN, while LSH etc. work better for radius joins. LSH is another option, but it cannot guarantee to find k neighbors and parameter tuning is tricky. So you may want to have a look at this recent ensemble approach instead.


> Add exact k-nearest-neighbours algorithm to machine learning library
> --------------------------------------------------------------------
>
>                 Key: FLINK-1745
>                 URL: https://issues.apache.org/jira/browse/FLINK-1745
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>              Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial it is still used as a mean to classify data and to do regression. This issue focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)