You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Bin Wang <wb...@gmail.com> on 2015/03/24 09:08:26 UTC

Optimize the first map reduce of DStream

Hi,

I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.

The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.

For example, if I want to count the lines which contains word "Spark", I
just need to map every line to see if it contains word, then reduce it with
a sum function. After that, this line is no longer useful to keep it in
memory.

That is said, if the DStream only have one map and/or reduce operation on
it. It is not necessary to keep all the batch data in the memory. Something
like a pipeline should be OK.

Is it difficult to implement on top of the current implementation?

Thanks.

---
Bin Wang

Re: Optimize the first map reduce of DStream

Posted by Zoltán Zvara <zo...@gmail.com>.

AFAIK Spark Streaming can not work in a way like this. Transformations are
made on DStreams, where DStreams are basically hold (time,
allocatedBlocksForBatch) pairs. Allocated blocks are allocated by the
JobGenerator, unallocated blocks (infos) are collected by
ReceivedBlockTracker. In Spark Streaming you define transformations and
actions on DStreams. The operators define RDD chains, tasks are created by
spark-core. You manipulate DStreams, not single unit of data. Flink for
example uses a continuous model. It optimizes for memory usage and latency.
Read the Spark Streaming paper and Spark paper for more reference.

Zvara Zoltán



mail, hangout, skype: zoltan.zvara@gmail.com

mobile, viber: +36203129543

bank: 10918001-00000021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-03-24 15:03 GMT+01:00 Bin Wang <wb...@gmail.com>:

> I'm not looking for limit the block size.
>
> Here is another example. Say we want to count the lines from the stream in
> one hour. In a normal program, we may write it like this:
>
> int sum = 0
> while (line = getFromStream()) {
>     store(line) // store the line into storage instead of memory.
>     sum++
> }
>
> This could be seen as a reduce. The only memory used here is just the
> variable named "line", need not store all the lines into memory (if lines
> would not use in other places). If we want to provide fault tolerance, we
> may just store lines into storage instead of in the memory. Could Spark
> streaming work like this way? Dose Flink work like this?
>
>
>
>
>
> On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara <zo...@gmail.com>
> wrote:
>
>> There is a BlockGenerator on each worker node next to the
>> ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
>> each interval (block_interval). These Blocks are passed to
>> ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
>> for storage. BlockInfos are passed to the driver. Mini-batches are created
>> by the JobGenerator component on the driver each batch_interval. I guess
>> what you are looking for is provided by a continuous model like Flink's. We
>> are creating mini-batches to provide fault tolerance.
>>
>> Zvara Zoltán
>>
>>
>>
>> mail, hangout, skype: zoltan.zvara@gmail.com
>>
>> mobile, viber: +36203129543
>>
>> bank: 10918001-00000021-50480008
>>
>> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>>
>> elte: HSKSJZ (ZVZOAAI.ELTE)
>>
>> 2015-03-24 11:55 GMT+01:00 Arush Kharbanda <ar...@sigmoidanalytics.com>:
>>
>>> The block size is configurable and that way I think you can reduce the
>>> block interval, to keep the block in memory only for the limiter
>>> interval?
>>> Is that what you are looking for?
>>>
>>> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wb...@gmail.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm learning Spark and I find there could be some optimize for the
>>> current
>>> > streaming implementation. Correct me if I'm wrong.
>>> >
>>> > The current streaming implementation put the data of one batch into
>>> memory
>>> > (as RDD). But it seems not necessary.
>>> >
>>> > For example, if I want to count the lines which contains word "Spark",
>>> I
>>> > just need to map every line to see if it contains word, then reduce it
>>> with
>>> > a sum function. After that, this line is no longer useful to keep it in
>>> > memory.
>>> >
>>> > That is said, if the DStream only have one map and/or reduce operation
>>> on
>>> > it. It is not necessary to keep all the batch data in the memory.
>>> Something
>>> > like a pipeline should be OK.
>>> >
>>> > Is it difficult to implement on top of the current implementation?
>>> >
>>> > Thanks.
>>> >
>>> > ---
>>> > Bin Wang
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>>
>>> *Arush Kharbanda* || Technical Teamlead
>>>
>>> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>>>
>>

Re: Optimize the first map reduce of DStream

Posted by Bin Wang <wb...@gmail.com>.

I'm not looking for limit the block size.

Here is another example. Say we want to count the lines from the stream in
one hour. In a normal program, we may write it like this:

int sum = 0
while (line = getFromStream()) {
    store(line) // store the line into storage instead of memory.
    sum++
}

This could be seen as a reduce. The only memory used here is just the
variable named "line", need not store all the lines into memory (if lines
would not use in other places). If we want to provide fault tolerance, we
may just store lines into storage instead of in the memory. Could Spark
streaming work like this way? Dose Flink work like this?





On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara <zo...@gmail.com> wrote:

> There is a BlockGenerator on each worker node next to the
> ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
> each interval (block_interval). These Blocks are passed to
> ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
> for storage. BlockInfos are passed to the driver. Mini-batches are created
> by the JobGenerator component on the driver each batch_interval. I guess
> what you are looking for is provided by a continuous model like Flink's. We
> are creating mini-batches to provide fault tolerance.
>
> Zvara Zoltán
>
>
>
> mail, hangout, skype: zoltan.zvara@gmail.com
>
> mobile, viber: +36203129543
>
> bank: 10918001-00000021-50480008
>
> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a
>
> elte: HSKSJZ (ZVZOAAI.ELTE)
>
> 2015-03-24 11:55 GMT+01:00 Arush Kharbanda <ar...@sigmoidanalytics.com>:
>
>> The block size is configurable and that way I think you can reduce the
>> block interval, to keep the block in memory only for the limiter interval?
>> Is that what you are looking for?
>>
>> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I'm learning Spark and I find there could be some optimize for the
>> current
>> > streaming implementation. Correct me if I'm wrong.
>> >
>> > The current streaming implementation put the data of one batch into
>> memory
>> > (as RDD). But it seems not necessary.
>> >
>> > For example, if I want to count the lines which contains word "Spark", I
>> > just need to map every line to see if it contains word, then reduce it
>> with
>> > a sum function. After that, this line is no longer useful to keep it in
>> > memory.
>> >
>> > That is said, if the DStream only have one map and/or reduce operation
>> on
>> > it. It is not necessary to keep all the batch data in the memory.
>> Something
>> > like a pipeline should be OK.
>> >
>> > Is it difficult to implement on top of the current implementation?
>> >
>> > Thanks.
>> >
>> > ---
>> > Bin Wang
>> >
>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>

Re: Optimize the first map reduce of DStream

Posted by Zoltán Zvara <zo...@gmail.com>.

There is a BlockGenerator on each worker node next to the
ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
each interval (block_interval). These Blocks are passed to
ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
for storage. BlockInfos are passed to the driver. Mini-batches are created
by the JobGenerator component on the driver each batch_interval. I guess
what you are looking for is provided by a continuous model like Flink's. We
are creating mini-batches to provide fault tolerance.

Zvara Zoltán



mail, hangout, skype: zoltan.zvara@gmail.com

mobile, viber: +36203129543

bank: 10918001-00000021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-03-24 11:55 GMT+01:00 Arush Kharbanda <ar...@sigmoidanalytics.com>:

> The block size is configurable and that way I think you can reduce the
> block interval, to keep the block in memory only for the limiter interval?
> Is that what you are looking for?
>
> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wb...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm learning Spark and I find there could be some optimize for the
> current
> > streaming implementation. Correct me if I'm wrong.
> >
> > The current streaming implementation put the data of one batch into
> memory
> > (as RDD). But it seems not necessary.
> >
> > For example, if I want to count the lines which contains word "Spark", I
> > just need to map every line to see if it contains word, then reduce it
> with
> > a sum function. After that, this line is no longer useful to keep it in
> > memory.
> >
> > That is said, if the DStream only have one map and/or reduce operation on
> > it. It is not necessary to keep all the batch data in the memory.
> Something
> > like a pipeline should be OK.
> >
> > Is it difficult to implement on top of the current implementation?
> >
> > Thanks.
> >
> > ---
> > Bin Wang
> >
>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> arush@sigmoidanalytics.com || www.sigmoidanalytics.com
>

Re: Optimize the first map reduce of DStream

Posted by Arush Kharbanda <ar...@sigmoidanalytics.com>.

The block size is configurable and that way I think you can reduce the
block interval, to keep the block in memory only for the limiter interval?
Is that what you are looking for?

On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wb...@gmail.com> wrote:

> Hi,
>
> I'm learning Spark and I find there could be some optimize for the current
> streaming implementation. Correct me if I'm wrong.
>
> The current streaming implementation put the data of one batch into memory
> (as RDD). But it seems not necessary.
>
> For example, if I want to count the lines which contains word "Spark", I
> just need to map every line to see if it contains word, then reduce it with
> a sum function. After that, this line is no longer useful to keep it in
> memory.
>
> That is said, if the DStream only have one map and/or reduce operation on
> it. It is not necessary to keep all the batch data in the memory. Something
> like a pipeline should be OK.
>
> Is it difficult to implement on top of the current implementation?
>
> Thanks.
>
> ---
> Bin Wang
>



-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

arush@sigmoidanalytics.com || www.sigmoidanalytics.com