You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Henggang Cui <cu...@gmail.com> on 2014/06/10 03:00:00 UTC

Merging all Spark Streaming RDDs to one RDD

Hi,

I'm wondering whether it's possible to continuously merge the RDDs coming
from a stream into a single RDD efficiently.

One thought is to use the union() method. But using union, I will get a new
RDD each time I do a merge. I don't know how I should name these RDDs,
because I remember Spark does not encourage users to create an array of
RDDs.

Another possible solution is to follow the example of
"StatefulNetworkWordCount", which uses the updateStateByKey() method. But
my RDD type is not key value pairs (it's a struct with multiple fields). Is
there a workaround?

Thanks,
Cui

Re: Merging all Spark Streaming RDDs to one RDD

Posted by un...@gmail.com.

I have much the same issue.

While I haven't totally solved it yet, I have found the "window" method useful for batching up archive blocks - but updateStateByKey is probably what we want to use, perhaps multiple times. If that works.

My bigger worry now is storage. Unlike non-streaming apps, we tend to build up state that cannot be regenerated, and hadoop files don't seem to be the best solution.

Jeremy Lee   BCompSci (Hons)
The Unorthodox Engineers

> On 10 Jun 2014, at 11:00 am, Henggang Cui <cu...@gmail.com> wrote:
> 
> Hi,
> 
> I'm wondering whether it's possible to continuously merge the RDDs coming from a stream into a single RDD efficiently.
> 
> One thought is to use the union() method. But using union, I will get a new RDD each time I do a merge. I don't know how I should name these RDDs, because I remember Spark does not encourage users to create an array of RDDs.
> 
> Another possible solution is to follow the example of "StatefulNetworkWordCount", which uses the updateStateByKey() method. But my RDD type is not key value pairs (it's a struct with multiple fields). Is there a workaround?
> 
> Thanks,
> Cui