You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Daniel Blazevski (JIRA)" <ji...@apache.org> on 2016/04/06 05:06:25 UTC

[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

    [ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227632#comment-15227632 ] 

Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:05 AM:
-----------------------------------------------------------------

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up and running.  The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training points are increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000 in 6D:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if that's possible.  Any thoughts?  Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is quite complex.  They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff.  One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points.  e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3])

Any thoughts?


was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up and running.  The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training points are increased.  

8,000 points (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if that's possible.  Any thoughts?  Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is quite complex.  They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff.  One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points.  e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-1934
>                 URL: https://issues.apache.org/jira/browse/FLINK-1934
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Daniel Blazevski
>              Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. However, due to the computational costs of an exact implementation, it does not scale well to large amounts of data. Therefore, it is worthwhile to also add an approximative kNN implementation as proposed in [1,2].  Reference [3] is cited a few times in [1], and gives necessary background on the z-value approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)