You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by foobar <he...@fb.com> on 2015/07/26 00:07:06 UTC

Multiple operations on same DStream in Spark Streaming

Hi I'm working with Spark Streaming using scala, and trying to figure out the
following problem. In my DStream[(int, int)], each record is an int pair
tuple. For each batch, I would like to filter out all records with first
integer below average of first integer in this batch, and for all records
with first integer above average of first integer in the batch, compute the
average of second integers in such records. What's the best practice to
implement this? I tried this but kept getting the object not serializable
exception because it's hard to share variables (such as average of first int
in the batch) between workers and driver. Any suggestions? Thanks! 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-operations-on-same-DStream-in-Spark-Streaming-tp23995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Multiple operations on same DStream in Spark Streaming

Posted by Dean Wampler <de...@gmail.com>.

Is this average supposed to be across all partitions? If so, it will
require some one of the reduce operations in every batch interval. If
that's too slow for the data rate, I would investigate using
PairDStreamFunctions.updateStateByKey to compute the sum + count of the 2nd
integers, per 1st integer, then do the filtering and final averaging
"downstream" if you can, i.e., where you actually need the final value. If
you need it on every batch iteration, then you'll have to do a reduce per
iteration.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Jul 28, 2015 at 3:14 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> One approach would be to store the batch data in an intermediate storage
> (like HBase/MySQL or even in zookeeper), and inside your filter function
> you just go and read the previous value from this storage and do whatever
> operation that you are supposed to do.
>
> Thanks
> Best Regards
>
> On Sun, Jul 26, 2015 at 3:37 AM, foobar <he...@fb.com> wrote:
>
>> Hi I'm working with Spark Streaming using scala, and trying to figure out
>> the
>> following problem. In my DStream[(int, int)], each record is an int pair
>> tuple. For each batch, I would like to filter out all records with first
>> integer below average of first integer in this batch, and for all records
>> with first integer above average of first integer in the batch, compute
>> the
>> average of second integers in such records. What's the best practice to
>> implement this? I tried this but kept getting the object not serializable
>> exception because it's hard to share variables (such as average of first
>> int
>> in the batch) between workers and driver. Any suggestions? Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-operations-on-same-DStream-in-Spark-Streaming-tp23995.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Multiple operations on same DStream in Spark Streaming

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

One approach would be to store the batch data in an intermediate storage
(like HBase/MySQL or even in zookeeper), and inside your filter function
you just go and read the previous value from this storage and do whatever
operation that you are supposed to do.

Thanks
Best Regards

On Sun, Jul 26, 2015 at 3:37 AM, foobar <he...@fb.com> wrote:

> Hi I'm working with Spark Streaming using scala, and trying to figure out
> the
> following problem. In my DStream[(int, int)], each record is an int pair
> tuple. For each batch, I would like to filter out all records with first
> integer below average of first integer in this batch, and for all records
> with first integer above average of first integer in the batch, compute the
> average of second integers in such records. What's the best practice to
> implement this? I tried this but kept getting the object not serializable
> exception because it's hard to share variables (such as average of first
> int
> in the batch) between workers and driver. Any suggestions? Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-operations-on-same-DStream-in-Spark-Streaming-tp23995.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>