You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sen Fang (JIRA)" <ji...@apache.org> on 2015/04/05 18:28:33 UTC

[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

    [ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396306#comment-14396306 ] 

Sen Fang commented on SPARK-2336:
---------------------------------

I'm tentatively going to give the hybrid spilltree implementation described in OP's post a try. Specifically I'm going to follow the implementation described in: http://dx.doi.org/10.1109/WACV.2007.18

According to the paper, the algorithm scales well in terms of number of observations, bounded by the available memory across clusters (billions in paper's example). The original hybrid spilltree paper claims the algorithm scales much better than other alternatives (LSH, metric tree) when it comes to number of features (up to hundreds). Furthermore, random projection is often used to reduce the dimensionality of feature space. Random projection is out of scope in my implementation and should probably implemented as a separate dimension reduction technique. (in the paper 4000 features are projected down to 100)

The average runtime for training and prediction is generally O(log n). The training runtime will suffer if we turn up the buffer size. The search is only approximate on overlapping node and the higher the buffer size the more accurate the search is. The prediction runtime will suffer when backtracking search is needed in a non-overlap node (which is more likely as balance threshold increases). 

More specifically, metric tree promises accurate but less performant search while spill tree creates trees with overlapping nodes and uses more efficient defeatist search whose accuracy is related to the buffer size *tau* (the larger it is the more accurate the search but deeper tree). Hybrid spill tree is constructed with both overlapping (spill tree) and non-overlapping (metric tree) node and uses balance threshold *rho* to balances the tree depth and search efficiency.


A high level overview of the algorithm is as follows:
1. Sample M data points (M = number of partitions)
2. Build the top level metric tree
3. Repartition RDD by assigning each point to leaf node of the above tree
4. Build a hybrid spill tree at each partition 
This concludes the training phase of kNN.

Prediction is achieved by identify the subtree it falls into then run prediction on that subtree.


Let me know if anyone has any thoughts, concerns or corrections on the things I said above.


> Approximate k-NN Models for MLLib
> ---------------------------------
>
>                 Key: SPARK-2336
>                 URL: https://issues.apache.org/jira/browse/SPARK-2336
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Brian Gawalt
>            Priority: Minor
>              Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org