You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alex Minnaar <am...@verticalscope.com> on 2015/03/16 13:57:12 UTC

Iterative Algorithms with Spark Streaming

I wanted to ask a basic question about the types of algorithms that are possible to apply to a DStream with Spark streaming.  With Spark it is possible to perform iterative computations on RDDs like in the gradient descent example


  val points = spark.textFile(...).map(parsePoint).cache()
    var w = Vector.random(D) // current separating plane
    for (i <- 1 to ITERATIONS) {
      val gradient = points.map(p =>
        (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
      ).reduce(_ + _)
      w -= gradient
    }


which has a global state w that is updated after each iteration and the updated value is then used in the next iteration.  My question is whether this type of algorithm is possible if the points variable was a DStream instead of an RDD?  It seems like you could perform the same map as above which would create a gradient DStream and also use updateStateByKey to create a DStream for the w variable.  But the problem is that there doesn't seem to be a way to reuse the w DStream inside the map.  I don't think that it is possible for DStreams to communicate this way.  Am I correct that this is not possible with DStreams or am I missing something?


Note:  The reason I ask this question is that many machine learning algorithms are trained by stochastic gradient descent.  sgd is similar to the above gradient descent algorithm except each iteration is on a new "minibatch" of data points rather than the same data points for every iteration.  It seems like Spark streaming provides a natural way to stream in these minibatches (as RDDs) but if it is not able to keep track of an updating global state variable then I don't think it Spark streaming can be used for sgd.


Thanks,


Alex

Re: Iterative Algorithms with Spark Streaming

Posted by Nick Pentreath <ni...@gmail.com>.
MLlib supports streaming linear models:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
and k-means:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

With an iteration parameter of 1, this amounts to mini-batch SGD where the
mini-batch is the Spark Streaming batch.

On Mon, Mar 16, 2015 at 2:57 PM, Alex Minnaar <am...@verticalscope.com>
wrote:

>  I wanted to ask a basic question about the types of algorithms that are
> possible to apply to a DStream with Spark streaming.  With Spark it is
> possible to perform iterative computations on RDDs like in the gradient
> descent example
>
>
>    val points = spark.textFile(...).map(parsePoint).cache()
>     var w = Vector.random(D) // current separating plane
>     for (i <- 1 to ITERATIONS) {
>       val gradient = points.map(p =>
>         (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
>       ).reduce(_ + _)
>       w -= gradient
>     }
>
>
>  which has a global state w that is updated after each iteration and the
> updated value is then used in the next iteration.  My question is whether
> this type of algorithm is possible if the points variable was a DStream
> instead of an RDD?  It seems like you could perform the same map as above
> which would create a gradient DStream and also use updateStateByKey to
> create a DStream for the w variable.  But the problem is that there doesn't
> seem to be a way to reuse the w DStream inside the map.  I don't think that
> it is possible for DStreams to communicate this way.  Am I correct that
> this is not possible with DStreams or am I missing something?
>
>
>  Note:  The reason I ask this question is that many machine learning
> algorithms are trained by stochastic gradient descent.  sgd is similar to
> the above gradient descent algorithm except each iteration is on a new
> "minibatch" of data points rather than the same data points for every
> iteration.  It seems like Spark streaming provides a natural way to stream
> in these minibatches (as RDDs) but if it is not able to keep track of an
> updating global state variable then I don't think it Spark streaming can be
> used for sgd.
>
>
>  Thanks,
>
>
>  Alex
>