You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/04/25 22:34:12 UTC

[jira] [Commented] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

    [ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256996#comment-15256996 ] 

Joseph K. Bradley commented on SPARK-14880:
-------------------------------------------

Thanks for this suggestion.  To get this feature merged, we would likely need (a) more theoretical evidence supporting the algorithm and (b) significant performance testing to demonstrate the improvements.  For (a), as I recall, the Zinkevich work requires that the loss be smooth, which would rule out support for L1 regularization.  Also, has the higher level iteration been analyzed to prove its effect on convergence?

This could be a good algorithm to post as a Spark package.  Would you be interested in doing that?

I'm going to close this issue for now, but discussion can continue on the closed JIRA.

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---------------------------------------------------------------
>
>                 Key: SPARK-14880
>                 URL: https://issues.apache.org/jira/browse/SPARK-14880
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Ahmed Mahran
>              Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>    iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
>     iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org