You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jon Chase <jo...@gmail.com> on 2015/07/16 03:58:40 UTC

Possible to combine all RDDs from a DStream batch into one?

I'm currently doing something like this in my Spark Streaming program
(Java):

        dStream.foreachRDD((rdd, batchTime) -> {
            log.info("processing RDD from batch {}", batchTime);
            ....
            // my rdd processing code
            ....
        });

Instead of having my rdd processing code called once for each RDD in the
batch, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?

My goal here is to do an operation exactly once for every batch.  As I
understand it, foreachRDD is going to do the operation once for each RDD in
the batch, which is not what I want.

I've looked at DStream.repartition(int), but the docs make it sound like it
only changes the number of partitions in the batch's existing RDDs, not the
number of RDDs.

Re: Possible to combine all RDDs from a DStream batch into one?

Posted by Ted Yu <yu...@gmail.com>.

Looks like this method should serve Jon's needs:

  def reduceByWindow(
      reduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration

On Wed, Jul 15, 2015 at 8:23 PM, N B <nb...@gmail.com> wrote:

> Hi Jon,
>
> In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
> interchangeably. If you are trying to collect multiple batches across a
> DStream into a single RDD, look at the window() operations.
>
> Hope this helps
> Nikunj
>
>
> On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jo...@gmail.com> wrote:
>
>> I should note that the amount of data in each batch is very small, so I'm
>> not concerned with performance implications of grouping into a single RDD.
>>
>> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jo...@gmail.com> wrote:
>>
>>> I'm currently doing something like this in my Spark Streaming program
>>> (Java):
>>>
>>>         dStream.foreachRDD((rdd, batchTime) -> {
>>>             log.info("processing RDD from batch {}", batchTime);
>>>             ....
>>>             // my rdd processing code
>>>             ....
>>>         });
>>>
>>> Instead of having my rdd processing code called once for each RDD in the
>>> batch, is it possible to essentially group all of the RDDs from the batch
>>> into a single RDD and single partition and therefore operate on all of the
>>> elements in the batch at once?
>>>
>>> My goal here is to do an operation exactly once for every batch.  As I
>>> understand it, foreachRDD is going to do the operation once for each RDD in
>>> the batch, which is not what I want.
>>>
>>> I've looked at DStream.repartition(int), but the docs make it sound like
>>> it only changes the number of partitions in the batch's existing RDDs, not
>>> the number of RDDs.
>>>
>>
>>
>

Re: Possible to combine all RDDs from a DStream batch into one?

Posted by N B <nb...@gmail.com>.

Hi Jon,

In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
interchangeably. If you are trying to collect multiple batches across a
DStream into a single RDD, look at the window() operations.

Hope this helps
Nikunj


On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jo...@gmail.com> wrote:

> I should note that the amount of data in each batch is very small, so I'm
> not concerned with performance implications of grouping into a single RDD.
>
> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jo...@gmail.com> wrote:
>
>> I'm currently doing something like this in my Spark Streaming program
>> (Java):
>>
>>         dStream.foreachRDD((rdd, batchTime) -> {
>>             log.info("processing RDD from batch {}", batchTime);
>>             ....
>>             // my rdd processing code
>>             ....
>>         });
>>
>> Instead of having my rdd processing code called once for each RDD in the
>> batch, is it possible to essentially group all of the RDDs from the batch
>> into a single RDD and single partition and therefore operate on all of the
>> elements in the batch at once?
>>
>> My goal here is to do an operation exactly once for every batch.  As I
>> understand it, foreachRDD is going to do the operation once for each RDD in
>> the batch, which is not what I want.
>>
>> I've looked at DStream.repartition(int), but the docs make it sound like
>> it only changes the number of partitions in the batch's existing RDDs, not
>> the number of RDDs.
>>
>
>

Re: Possible to combine all RDDs from a DStream batch into one?

Posted by Jon Chase <jo...@gmail.com>.

I should note that the amount of data in each batch is very small, so I'm
not concerned with performance implications of grouping into a single RDD.

On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jo...@gmail.com> wrote:

> I'm currently doing something like this in my Spark Streaming program
> (Java):
>
>         dStream.foreachRDD((rdd, batchTime) -> {
>             log.info("processing RDD from batch {}", batchTime);
>             ....
>             // my rdd processing code
>             ....
>         });
>
> Instead of having my rdd processing code called once for each RDD in the
> batch, is it possible to essentially group all of the RDDs from the batch
> into a single RDD and single partition and therefore operate on all of the
> elements in the batch at once?
>
> My goal here is to do an operation exactly once for every batch.  As I
> understand it, foreachRDD is going to do the operation once for each RDD in
> the batch, which is not what I want.
>
> I've looked at DStream.repartition(int), but the docs make it sound like
> it only changes the number of partitions in the batch's existing RDDs, not
> the number of RDDs.
>