You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by fr...@typesafe.com on 2014/12/18 20:04:58 UTC

Spark Streaming Data flow graph

I’ve been trying to produce an updated box diagram to refresh :
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26


… after the SPARK-3129, and other switches (a surprising number of comments still mention NetworkReceiver).


Here’s what I have so far:
https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0


This is not supposed to respect any particular convention (ER, ORM, …). Data flow up to right before RDD creation is in bold arrows, metadata flow is in normal width arrows.


This diagram is still very much a WIP (see below : todo), but I wanted to share it to ask:
- what’s wrong ?
- what are the glaring omissions ?
- how can I make this better (i.e. what should I add first to the Todo-list below) ?


I’ll be happy to share this (including sources) with whoever asks for it. 


Todo :
- mark private/public classes
- mark queues in Receiver, ReceivedBlockHandler, BlockManager
- mark type of info on transport : e.g. Actor message, ReceivedBlockInfo 



—
François Garillot

Re: Spark Streaming Data flow graph

Posted by François Garillot <fr...@typesafe.com>.

Thanks a LOT for your answer ! I've updated the diagram, at the same
address :
https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0

I've addressed your more straightforward remarks directly in the diagram. A
couple questions:

- the location of instances (Executor, Master, Driver) is now marked, I
hope I didn't make too many mistakes there, did I ?

- Given that the communication between instances and their members (e.g.
ReceiverSupervisor / ReceivedBlockHandler) is willingly omitted, have I
forgotten any communication channels ?

- I've represented some queues / buffers using a red trapezoid. I'm thus
starting an inventory of queues or buffers, and I'm interested in adding
the 'implicit' ones as well (e.g. jobSets in JobScheduler, which is indexed
by time in ms). I'd be happy with pointers on where to look : ideally I'm
trying to see any place in the data flow where data is sitting idle for any
length of time, waiting to be chunked somehow (whether it's at the RDD or
block level doesn't really matter to me, I'm interested in all types of
'chunking').

Naturally, this is intended to be a developer document exclusively (hence
in particular why I'm not publicising this on the user ML).


On Mon, Jan 5, 2015 at 10:57 PM, Tathagata Das <ta...@gmail.com>
wrote:

> Hey François,
>
> Well, at a high-level here is what I thought about the diagram.
>
> - ReceiverSupervisor handles only one Receiver.
> - BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler
> - The blocks are inserted in BlockManager and if activated,
> WriteAheadLogManager in parallel, not through BlockManager as the
> diagram seems to imply
> - It would be good to have a clean visual separation of what runs in
> Executor (better term than Worker) and what is in Driver ... Driver
> stuff on left and Executor stuff on right, or vice versa.
>
> More importantly, the word of caution is that all the internal stuff
> like ReceiverBlockHandler, Supervisor, etc are subject to change any
> time as we keep refactoring stuff. So highlighting these internal
> details too much too publicly may lead to future confusion.
>
> TD
>
> On Thu, Dec 18, 2014 at 11:04 AM,  <fr...@typesafe.com> wrote:
> > I’ve been trying to produce an updated box diagram to refresh :
> >
> http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26
> >
> >
> > … after the SPARK-3129, and other switches (a surprising number of
> comments still mention NetworkReceiver).
> >
> >
> > Here’s what I have so far:
> > https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0
> >
> >
> > This is not supposed to respect any particular convention (ER, ORM, …).
> Data flow up to right before RDD creation is in bold arrows, metadata flow
> is in normal width arrows.
> >
> >
> > This diagram is still very much a WIP (see below : todo), but I wanted
> to share it to ask:
> > - what’s wrong ?
> > - what are the glaring omissions ?
> > - how can I make this better (i.e. what should I add first to the
> Todo-list below) ?
> >
> >
> > I’ll be happy to share this (including sources) with whoever asks for it.
> >
> >
> > Todo :
> > - mark private/public classes
> > - mark queues in Receiver, ReceivedBlockHandler, BlockManager
> > - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo
> >
> >
> >
> > —
> > François Garillot
>



-- 
François Garillot

Re: Spark Streaming Data flow graph

Posted by Tathagata Das <ta...@gmail.com>.

Hey François,

Well, at a high-level here is what I thought about the diagram.

- ReceiverSupervisor handles only one Receiver.
- BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler
- The blocks are inserted in BlockManager and if activated,
WriteAheadLogManager in parallel, not through BlockManager as the
diagram seems to imply
- It would be good to have a clean visual separation of what runs in
Executor (better term than Worker) and what is in Driver ... Driver
stuff on left and Executor stuff on right, or vice versa.

More importantly, the word of caution is that all the internal stuff
like ReceiverBlockHandler, Supervisor, etc are subject to change any
time as we keep refactoring stuff. So highlighting these internal
details too much too publicly may lead to future confusion.

TD

On Thu, Dec 18, 2014 at 11:04 AM,  <fr...@typesafe.com> wrote:
> I’ve been trying to produce an updated box diagram to refresh :
> http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26
>
>
> … after the SPARK-3129, and other switches (a surprising number of comments still mention NetworkReceiver).
>
>
> Here’s what I have so far:
> https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0
>
>
> This is not supposed to respect any particular convention (ER, ORM, …). Data flow up to right before RDD creation is in bold arrows, metadata flow is in normal width arrows.
>
>
> This diagram is still very much a WIP (see below : todo), but I wanted to share it to ask:
> - what’s wrong ?
> - what are the glaring omissions ?
> - how can I make this better (i.e. what should I add first to the Todo-list below) ?
>
>
> I’ll be happy to share this (including sources) with whoever asks for it.
>
>
> Todo :
> - mark private/public classes
> - mark queues in Receiver, ReceivedBlockHandler, BlockManager
> - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo
>
>
>
> —
> François Garillot

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org