You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Gal Vinograd <ga...@crosswise.com> on 2016/02/02 17:16:57 UTC

Re: Question regarding instability of EdgeProperty DataSourceType

Thank you so much, this is a really well described answer :)

On Sun, Jan 31, 2016 at 7:57 PM Hitesh Shah <hi...@apache.org> wrote:

> There are 3 types defined as you have noticed:
>
> persisted_reliable: assumes a vertex output is stored in a reliable store
> like HDFS. This states that if the node on which the task ran disappears,
> the output is still available.
> persisted: vertex output stored on local disk where the task ran.
> ephemeral: vertex output stored in task memory.
>
> From a data transmission point of view, all data is always transmitted
> over network unless there is a case where the downstream task is running on
> same machine as the task that generated the output. In that case, it can
> read from local disk if needed.
>
> You are right that the in-memory support is not built out so a co-located
> task potentially reading from another task’s memory is therefore not
> supported today. The network channel requires a bit more explanation.
> Generally, all data is persisted to disk. This means that data transferred
> over the network is first written locally and then eventually pulled/pushed
> to a downstream task as needed. This does not mean that all the data needs
> to be generated first before being sent downstream. Data can be still
> generated in “chunks” and then sent downstream as when as a chunk becomes
> available. ( this functionality is internally called “pipelined shuffle” if
> you end up searching through the code/jiras ). However, again to clarify,
> there is no pure streaming support yet where data is kept in memory and
> pushed downstream.
>
> To add, the in-memory approach requires changing Tez to have a different
> fault tolerance model to be applied and it also needs more cluster capacity
> to ensure that both upstream and downstream tasks can run concurrently. Do
> you see this as a requirement for something that you are planning to use
> Tez for?
>
> thanks
> — Hitesh
>
> On Jan 31, 2016, at 12:30 AM, Gal Vinograd <ga...@crosswise.com> wrote:
>
> > Hey,
> >
> > I read though tez-api code and noticed that PERSISTED is the only stable
> data source type. Does that mean that data isn't transmitted between
> vertices though in-memory or network channels?
> >
> > Thanks :)
>
>