You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Gal Vinograd <ga...@crosswise.com> on 2016/01/31 09:30:19 UTC

Question regarding instability of EdgeProperty DataSourceType

Hey,

I read though tez-api code and noticed that PERSISTED is the only stable
data source type. Does that mean that data isn't transmitted between
vertices though in-memory or network channels?

Thanks :)

Re: Question regarding instability of EdgeProperty DataSourceType

Posted by Gal Vinograd <ga...@crosswise.com>.
Thank you so much, this is a really well described answer :)

On Sun, Jan 31, 2016 at 7:57 PM Hitesh Shah <hi...@apache.org> wrote:

> There are 3 types defined as you have noticed:
>
> persisted_reliable: assumes a vertex output is stored in a reliable store
> like HDFS. This states that if the node on which the task ran disappears,
> the output is still available.
> persisted: vertex output stored on local disk where the task ran.
> ephemeral: vertex output stored in task memory.
>
> From a data transmission point of view, all data is always transmitted
> over network unless there is a case where the downstream task is running on
> same machine as the task that generated the output. In that case, it can
> read from local disk if needed.
>
> You are right that the in-memory support is not built out so a co-located
> task potentially reading from another task’s memory is therefore not
> supported today. The network channel requires a bit more explanation.
> Generally, all data is persisted to disk. This means that data transferred
> over the network is first written locally and then eventually pulled/pushed
> to a downstream task as needed. This does not mean that all the data needs
> to be generated first before being sent downstream. Data can be still
> generated in “chunks” and then sent downstream as when as a chunk becomes
> available. ( this functionality is internally called “pipelined shuffle” if
> you end up searching through the code/jiras ). However, again to clarify,
> there is no pure streaming support yet where data is kept in memory and
> pushed downstream.
>
> To add, the in-memory approach requires changing Tez to have a different
> fault tolerance model to be applied and it also needs more cluster capacity
> to ensure that both upstream and downstream tasks can run concurrently. Do
> you see this as a requirement for something that you are planning to use
> Tez for?
>
> thanks
> — Hitesh
>
> On Jan 31, 2016, at 12:30 AM, Gal Vinograd <ga...@crosswise.com> wrote:
>
> > Hey,
> >
> > I read though tez-api code and noticed that PERSISTED is the only stable
> data source type. Does that mean that data isn't transmitted between
> vertices though in-memory or network channels?
> >
> > Thanks :)
>
>

Re: Question regarding instability of EdgeProperty DataSourceType

Posted by Hitesh Shah <hi...@apache.org>.
There are 3 types defined as you have noticed:

persisted_reliable: assumes a vertex output is stored in a reliable store like HDFS. This states that if the node on which the task ran disappears, the output is still available. 
persisted: vertex output stored on local disk where the task ran. 
ephemeral: vertex output stored in task memory. 

From a data transmission point of view, all data is always transmitted over network unless there is a case where the downstream task is running on same machine as the task that generated the output. In that case, it can read from local disk if needed.

You are right that the in-memory support is not built out so a co-located task potentially reading from another task’s memory is therefore not supported today. The network channel requires a bit more explanation. Generally, all data is persisted to disk. This means that data transferred over the network is first written locally and then eventually pulled/pushed to a downstream task as needed. This does not mean that all the data needs to be generated first before being sent downstream. Data can be still generated in “chunks” and then sent downstream as when as a chunk becomes available. ( this functionality is internally called “pipelined shuffle” if you end up searching through the code/jiras ). However, again to clarify, there is no pure streaming support yet where data is kept in memory and pushed downstream. 

To add, the in-memory approach requires changing Tez to have a different fault tolerance model to be applied and it also needs more cluster capacity to ensure that both upstream and downstream tasks can run concurrently. Do you see this as a requirement for something that you are planning to use Tez for? 

thanks
— Hitesh

On Jan 31, 2016, at 12:30 AM, Gal Vinograd <ga...@crosswise.com> wrote:

> Hey,
> 
> I read though tez-api code and noticed that PERSISTED is the only stable data source type. Does that mean that data isn't transmitted between vertices though in-memory or network channels?
> 
> Thanks :)