You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by danielblazevski <gi...@git.apache.org> on 2016/05/30 00:10:46 UTC

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

GitHub user danielblazevski opened a pull request:

    https://github.com/apache/flink/pull/2048

    [Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

    I added approximate knn algorithms.  In another PR, there are two exact methods, one basic algorithms using a prirority queue and another using a quadtree (see: https://github.com/apache/flink/pull/1220 ).  
    
    For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) based knn.  Z-values are good for low-to-moderate dimension.  For details, see the paper [2] someone put on the exact JIRA issue: 
    https://issues.apache.org/jira/browse/FLINK-1745
    https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
    
    As the z-values aren't applicable for larger dimension, so I used -- as the paper suggests -- a more standard LSH approach. 
    
    The paper describes a fairly sophisticated MapReduce (MR) design, which I did not use.  Using the same MR design pattern as the exact method, I found really good performance improvement!  In JIRA, I ran this by @tillrohrmann, and he was OK with a less optimized version for now. Here is a link for a talk I recently gave on this, which includes links for the video and slides:
    http://www.meetup.com/ny-scala/events/231163636/
    
    Because both the LSH and z-value use the same MR design pattern, I reformatted the codebase from the PR for exact version a bit to make it more modular.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2048
    
----
commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park <ch...@apache.org>
Date:   2015-06-30T08:41:25Z

    [FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-15T21:49:05Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-15T21:49:17Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-16T01:26:57Z

    nightly commit, minor changes:  got the filter to work, working on mapping the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T02:02:51Z

    commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T15:49:25Z

    did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T21:03:04Z

    transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T00:41:29Z

    basic test on 2D unit box seems to work -- need to generalize, e.g. to include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T15:03:06Z

    had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T15:22:46Z

    Testing 2D looks good, strong improvement in run time compared to brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T19:04:52Z

    added automated detection of bounding box based on min/max values of both training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T00:28:56Z

    added automated radius about test point to define localized neighborhood, result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T00:54:06Z

    Note for future:  previous commit passed test of Chiwan Park had in intial knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T01:56:38Z

    Note for future:  previous commit passed 3D version of the test that Chiwan Park had in the intial knn implementation

commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T14:27:20Z

    changed filename of QuadTreeTest to QuadTreeSuite, about to make test more comprehensive and similar format to other Flink tests

commit ac81561cad27b65d158ae08fd0fb15bdb51d1c8b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T19:51:32Z

    refactored testing of QuadTree, and added more tests

commit b17f82d5ce0214617c8dbc4a387410057d6f3832
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T22:49:10Z

    added KNNBenchmark to check runtimes

commit 530565835d4b5934fcac9e0e51105bb669fec9be
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T22:49:17Z

    added KNNBenchmark to check runtimes

commit 1f946cb30450604e92bbd0f5959ce9a60eb4c41b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-25T01:13:00Z

    fixed bug -- in find siblings, needed to search for minimal bounding boxes

commit 22e4eb7b57795ad1ca4392ca1c1a8bdae76afa8e
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-27T03:34:16Z

    added more thorough benchmark files; about to modify  bounding box to only bound training set and modify search for the testing set

commit 3723f6b09ec7d45f6444df70a5f699cbf998a4bb
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-27T03:34:21Z

    added more thorough benchmark files; about to modify  bounding box to only bound training set and modify search for the testing set

commit c41d3e1029bf81a37cf3594f202b904e2d99e3ac
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T03:53:12Z

    major simplification in choosing the radius to look at nearby neighbors

commit 7c77ea20fd9e8a0c4a33c81b83187c84b380d6b2
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T04:03:59Z

    cleaned up; commit before deleting previous sibling search

commit cf4aa5d75611db19040466040c3d29432cb0e5f7
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T14:02:51Z

    added new devel branch to push temp changes on github

commit ec6ddb0a57136075b4f77616e6e48eb5bcc50a11
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T00:17:33Z

    cleaned up; removed comments and a unused method

commit 7ed9926d8207b5f59b4ceb968d7ebd732029f5c3
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T20:28:31Z

    fixed bug in bFiltVect, and renamed to trainingFiltered

commit 4b3bb2ec92396bf754d0d207ffc6853406ce7c39
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T21:05:46Z

    fixed bug:  if there are fewer than maxPerBox total training points, do not do heap construction, just make siblingsQueue = root.objects

commit 1662b38822dfdfbbca272d377fa3b94f8246e9e6
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:03:42Z

    added metric to constructor, and added a flag to test whether it is Euclidean or SquaredEuclidean

commit f654b841cc8d91fa861f188469831404c288b227
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:48:51Z

    changed name of test that uses the Quadtree along with KNN -- modified from CHiwan Park's test to ensure flag to use Quadtree will pass

commit 7928798281dba5554eeb63df7e67400a42e7a381
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:54:25Z

    fixed QuadTree test to conform to using a min-heap

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
GitHub user danielblazevski reopened a pull request:

    https://github.com/apache/flink/pull/2048

    [Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

    I added approximate knn algorithms.  In another PR, there are two exact methods, one basic algorithms using a prirority queue and another using a quadtree (see: https://github.com/apache/flink/pull/1220 ).  
    
    For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) based knn.  Z-values are good for low-to-moderate dimension.  For details, see the paper [2] someone put on the exact JIRA issue: 
    https://issues.apache.org/jira/browse/FLINK-1745
    https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
    
    The z-value approach isn't applicable for larger dimensions, so I used -- as the paper suggests -- a more standard LSH approach. 
    
    The paper describes a fairly sophisticated MapReduce (MR) design, which I did not use.  Using the same MR design pattern as the exact method, I found really good performance improvement!  In JIRA, I ran this by @tillrohrmann, and he was OK with a less optimized version for now. Here is a link for a talk I recently gave on this, which includes links for the video and slides:
    http://www.meetup.com/ny-scala/events/231163636/
    
    Because both the LSH and z-value use the same MR design pattern as the exact versions, I reformatted the codebase from the PR for exact version a bit to make it more modular.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2048
    
----
commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park <ch...@apache.org>
Date:   2015-06-30T08:41:25Z

    [FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-15T21:49:05Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-15T21:49:17Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-16T01:26:57Z

    nightly commit, minor changes:  got the filter to work, working on mapping the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T02:02:51Z

    commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T15:49:25Z

    did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-22T21:03:04Z

    transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T00:41:29Z

    basic test on 2D unit box seems to work -- need to generalize, e.g. to include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T15:03:06Z

    had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T15:22:46Z

    Testing 2D looks good, strong improvement in run time compared to brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-23T19:04:52Z

    added automated detection of bounding box based on min/max values of both training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T00:28:56Z

    added automated radius about test point to define localized neighborhood, result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T00:54:06Z

    Note for future:  previous commit passed test of Chiwan Park had in intial knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T01:56:38Z

    Note for future:  previous commit passed 3D version of the test that Chiwan Park had in the intial knn implementation

commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T14:27:20Z

    changed filename of QuadTreeTest to QuadTreeSuite, about to make test more comprehensive and similar format to other Flink tests

commit ac81561cad27b65d158ae08fd0fb15bdb51d1c8b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T19:51:32Z

    refactored testing of QuadTree, and added more tests

commit b17f82d5ce0214617c8dbc4a387410057d6f3832
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T22:49:10Z

    added KNNBenchmark to check runtimes

commit 530565835d4b5934fcac9e0e51105bb669fec9be
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-24T22:49:17Z

    added KNNBenchmark to check runtimes

commit 1f946cb30450604e92bbd0f5959ce9a60eb4c41b
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-25T01:13:00Z

    fixed bug -- in find siblings, needed to search for minimal bounding boxes

commit 22e4eb7b57795ad1ca4392ca1c1a8bdae76afa8e
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-27T03:34:16Z

    added more thorough benchmark files; about to modify  bounding box to only bound training set and modify search for the testing set

commit 3723f6b09ec7d45f6444df70a5f699cbf998a4bb
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-27T03:34:21Z

    added more thorough benchmark files; about to modify  bounding box to only bound training set and modify search for the testing set

commit c41d3e1029bf81a37cf3594f202b904e2d99e3ac
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T03:53:12Z

    major simplification in choosing the radius to look at nearby neighbors

commit 7c77ea20fd9e8a0c4a33c81b83187c84b380d6b2
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T04:03:59Z

    cleaned up; commit before deleting previous sibling search

commit cf4aa5d75611db19040466040c3d29432cb0e5f7
Author: danielblazevski <da...@gmail.com>
Date:   2015-09-28T14:02:51Z

    added new devel branch to push temp changes on github

commit ec6ddb0a57136075b4f77616e6e48eb5bcc50a11
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T00:17:33Z

    cleaned up; removed comments and a unused method

commit 7ed9926d8207b5f59b4ceb968d7ebd732029f5c3
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T20:28:31Z

    fixed bug in bFiltVect, and renamed to trainingFiltered

commit 4b3bb2ec92396bf754d0d207ffc6853406ce7c39
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T21:05:46Z

    fixed bug:  if there are fewer than maxPerBox total training points, do not do heap construction, just make siblingsQueue = root.objects

commit 1662b38822dfdfbbca272d377fa3b94f8246e9e6
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:03:42Z

    added metric to constructor, and added a flag to test whether it is Euclidean or SquaredEuclidean

commit f654b841cc8d91fa861f188469831404c288b227
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:48:51Z

    changed name of test that uses the Quadtree along with KNN -- modified from CHiwan Park's test to ensure flag to use Quadtree will pass

commit 7928798281dba5554eeb63df7e67400a42e7a381
Author: danielblazevski <da...@gmail.com>
Date:   2015-10-01T22:54:25Z

    fixed QuadTree test to conform to using a min-heap

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222390974
  
    btw, not sure what the standards are, should I squash all this into one commit before doing a PR?  Or do you occasionally like to see many commits of a PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222397141
  
    Hi @danielblazevski, thanks for opening pull request. I would like to split approximative knn and exact knn. If the approximative method needs exact knn implementation, could you wait until merging #1220? I'll merge it in few days (maybe today or tomorrow at latest).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink-1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski closed the pull request at:

    https://github.com/apache/flink/pull/2048


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski closed the pull request at:

    https://github.com/apache/flink/pull/2048


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by chiwanpark <gi...@git.apache.org>.
Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222398107
  
    I think that closing this and re-opening after #1220 is merged is better because reviewing is easier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink-1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222489226
  
    Closing this and making a new PR to avoid conflicts.  My guess is that the conflicts are there since I opened the PR initially before the exact KNN PR was merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222398251
  
    Sounds good, about to close


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1934] Add approximative k-nearest-neigh...

Posted by danielblazevski <gi...@git.apache.org>.
Github user danielblazevski commented on the pull request:

    https://github.com/apache/flink/pull/2048#issuecomment-222397683
  
    @chiwanpark happy to wait. Should I close and re-open a PR when that's done or just leave this open knowing that it will not begin being reviewed until the other is merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---