You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "hucheng zhou (JIRA)" <ji...@apache.org> on 2015/04/21 09:37:58 UTC

[jira] [Comment Edited] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

    [ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504473#comment-14504473 ] 

hucheng zhou edited comment on SPARK-6567 at 4/21/15 7:37 AM:
--------------------------------------------------------------

We have already implemented a model-parallelism for logistic regression, i.e., treat the model/gradient/margin as RDD and partition them accordingly. There are three joins happens between data and weight, between margin and data, between gradient and weight. Simply use broadcastJoin, that first send smaller RDD to driver first and driver broadcasts the entire data out to all executors, would result in sub-optimal performance. It is unnecessary to involve driver that data can be communicated among executors, and one partition actually only needs part of the data, rather than the entire content.  Thereby, we abstracted and implemented a new join interface called partialBroadcastJoin that frees the bottleneck of driver and sends partition of the smaller-rdd to the corresponding partition of larger-rdd. 



was (Author: hucheng):
We have already implemented a model-parallelism for logistic regression, i.e., treat the model/gradient/margin as RDD and partition them accordingly. There are several joins happens between data and weight, between margin and data, between gradient and data, between gradient and weight. Simply use broadcastJoin, that first send smaller RDD to driver first and driver broadcasts the entire data out to all executors, would result in sub-optimal performance. It is unnecessary to involve driver that data can be communicated among executors, and one partition actually only needs part of the data, rather than the entire content.  Thereby, we abstracted and implemented a new join interface called partialBroadcastJoin that frees the bottleneck of driver and sends partition of the smaller-rdd to the corresponding partition of larger-rdd. 


> Large linear model parallelism via a join and reduceByKey
> ---------------------------------------------------------
>
>                 Key: SPARK-6567
>                 URL: https://issues.apache.org/jira/browse/SPARK-6567
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>            Reporter: Reza Zadeh
>
> To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server.
> There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products.
> This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org