You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Zoltán Zvara <zo...@gmail.com> on 2015/03/11 14:58:27 UTC

Spark Streaming - received block allocation to batch

I'm trying to understand the block allocation mechanism Spark uses to
generate batch jobs and a JobSet.

The JobGenerator.generateJobs tries to allocate received blocks to batch,
effectively in ReceivedBlockTracker.allocateBlocksToBatch creates
a streamIdToBlocks, where steam ID's (Int) mapped to Seq[ReceivedBlockInfo]
using getReceivedBlockQueue. This is where it gets tricky for me.

getReceivedBlockQueue of class ReceivedBlockTracker reads
streamIdToUnallocatedBlockQueues
that should be populated with ReceivedBlockQueues? Who inserts these
ReceivedBlockQueues into streamIdToUnallocatedBlockQueues and where does it
get written? I've found only usages of 'effectively' value read.

At a point streamIdToBlocks get packed into a case class
of AllocatedBlocks. Why is it necessary?

Also, at JobGenerator.generateJobs the line where receivedBlockInfos created,
shouldn't it be empty, because streamIdToUnallocatedBlockQueues never got
written to? Where do I miss the point? How does the JobGenerator.generateJobs
able to retrieve the received block infos?

Thanks,

ZZ

Re: Spark Streaming - received block allocation to batch

Posted by Tathagata Das <td...@databricks.com>.
See responses inline.

On Wed, Mar 11, 2015 at 6:58 AM, Zoltán Zvara <zo...@gmail.com>
wrote:

> I'm trying to understand the block allocation mechanism Spark uses to
> generate batch jobs and a JobSet.
>
> The JobGenerator.generateJobs tries to allocate received blocks to batch,
> effectively in ReceivedBlockTracker.allocateBlocksToBatch creates
> a streamIdToBlocks, where steam ID's (Int) mapped to Seq[ReceivedBlockInfo]
> using getReceivedBlockQueue. This is where it gets tricky for me.
>
> getReceivedBlockQueue of class ReceivedBlockTracker reads
> streamIdToUnallocatedBlockQueues
> that should be populated with ReceivedBlockQueues? Who inserts these
> ReceivedBlockQueues into streamIdToUnallocatedBlockQueues and where does it
> get written? I've found only usages of 'effectively' value read.
>
> Inserted here.
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala#L84


> At a point streamIdToBlocks get packed into a case class
> of AllocatedBlocks. Why is it necessary?
>

Just a container that captures all the blocks allocated to a batch. Used
for both tracking in memory as well as writing it out to the write ahead
log.


>
> Also, at JobGenerator.generateJobs the line where receivedBlockInfos
> created,
> shouldn't it be empty, because streamIdToUnallocatedBlockQueues never got
> written to? Where do I miss the point? How does the
> JobGenerator.generateJobs
> able to retrieve the received block infos?
>
> I think the above line number answers that.


> Thanks,
>
> ZZ
>