You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by David Capwell <dc...@gmail.com> on 2014/07/25 19:23:33 UTC

DataMovementType impls

So going through the code and not sure where the real logic
of DataMovementType gets used.

I see that in DagTypeConverts it can convert between DataMovementType
and PlanEdgeDataMovementType, but once that happens I don't really see a
way to implement any of these types.  Where is the implementations defined?
Is there any way to define my own impls?

Thanks for your time.

Re: DataMovementType impls

Posted by "Jianfeng (Jeff) Zhang" <jz...@hortonworks.com>.

Hi David,

DataMovementType is used for creating EdgeManager for different data
movement. (Check method createEdgeManager() in Edge.java)

You can define your own custom DataMovementType by defining your Edge
manager. Could you let us know what kind of custom data movement you'd like
to implement ?



Best Regards,
Jeff Zhang



On Fri, Jul 25, 2014 at 10:23 AM, David Capwell <dc...@gmail.com> wrote:

> So going through the code and not sure where the real logic
> of DataMovementType gets used.
>
> I see that in DagTypeConverts it can convert between DataMovementType
> and PlanEdgeDataMovementType, but once that happens I don't really see a
> way to implement any of these types.  Where is the implementations defined?
> Is there any way to define my own impls?
>
> Thanks for your time.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: DataMovementType impls

Posted by Bikas Saha <bi...@hortonworks.com>.

Please feel free to work out your use case and then outline it on this
thread. We may be able to help you figure out what exactly you would need
to do in order to integrate with Tez.

Bikas

*From:* David Capwell [mailto:dcapwell@gmail.com]
*Sent:* Friday, July 25, 2014 1:31 PM
*To:* user@tez.apache.org
*Subject:* Re: DataMovementType impls

Its more of a persisted service atm.  Ill take a look at defining this the
way you spoke of.  Thanks!

On Fri, Jul 25, 2014 at 12:11 PM, Siddharth Seth <ss...@apache.org> wrote:

Doing something like that would involve writing a new Outputs / Inputs, or
modifying the existing ones to write to a different sink. We have
prototyped such changes in the past - to write to HDFS as an example, and
the changes are not very complicated.

This involves changing how the existing Outputs write data, modifying
DataMovementEvent payloads to contain relevant data (where to fetch from),
and changing the Inputs to process this DataMovement payload to actually
fetch the data.

One thing to look at though - is that if you're writing directly to your
own service - will the data be persisted there, until it's read be the
downstream vertex - or does the data effectively need to be streamed
through (consumers and producer tasks running independently of each other,
or consumers and producer tasks must run at the same time).

On Fri, Jul 25, 2014 at 12:03 PM, David Capwell <dc...@gmail.com> wrote:

Was looking into saying that when two vertexes share data, that they can
choose to share that data over disk, or over our internal system (so share
over network).  In the cases where data persistence isn't needed and the
vertexes can be on the same node, then to ignore this system.

The use-case isn't really fleshed out at the moment.  Looking to prototype
to see how it would all play together.

On Fri, Jul 25, 2014 at 11:53 AM, Siddharth Seth <ss...@apache.org> wrote:

DataSourceType isn't really used at the moment. Eventually, it would serve
more as a scheduling and failure recovery mechanism more than deciding how
data gets persisted between stages. (This property could potentially be
used by some of the Inputs/Outputs to alter the way they persist data - but
that isn't currently on the cards).

This primarily applies to data written on Edges - are you somehow looking
to modify that, or use the data generated by an intermediate Vertex in a
separate process ?

Getting a little more info on the use case would be helpful in figuring out
how Tez can be used. Are you looking to read data from this internal
service, publish to it, or something else ?

On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <dc...@gmail.com> wrote:

Sorry, copy/paste issue.  I was looking at DataSourceType and trying to see
how data gets saved and read between tasks.  The use-case is that we have
an internal service that might be helpful for us, so wanted to prototype
how possible it would be to share data over different mechanism.

On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org> wrote:

DataMovementEvent is a construct defined for an Input/Output pair to
communicate with each other. The actual information being passed between
the 2 is not understood by the framework except in that, it is a byte
payload to be handed off from the source to the destination. Users are not
expected to create derived classes of this type but to use the payload
within the object to pass information around.

For example, most of the currently implemented Input-Output pairs ( for
shuffle/broadcast edges ) use the payload to pass the url specifying the
location of the data to be fetched.

thanks
— HItesh

On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:

> So going through the code and not sure where the real logic of
DataMovementType gets used.
>
> I see that in DagTypeConverts it can convert between DataMovementType and
PlanEdgeDataMovementType, but once that happens I don't really see a way to
implement any of these types.  Where is the implementations defined? Is
there any way to define my own impls?
>
> Thanks for your time.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: DataMovementType impls

Posted by David Capwell <dc...@gmail.com>.

Its more of a persisted service atm.  Ill take a look at defining this the
way you spoke of.  Thanks!


On Fri, Jul 25, 2014 at 12:11 PM, Siddharth Seth <ss...@apache.org> wrote:

> Doing something like that would involve writing a new Outputs / Inputs, or
> modifying the existing ones to write to a different sink. We have
> prototyped such changes in the past - to write to HDFS as an example, and
> the changes are not very complicated.
> This involves changing how the existing Outputs write data, modifying
> DataMovementEvent payloads to contain relevant data (where to fetch from),
> and changing the Inputs to process this DataMovement payload to actually
> fetch the data.
> One thing to look at though - is that if you're writing directly to your
> own service - will the data be persisted there, until it's read be the
> downstream vertex - or does the data effectively need to be streamed
> through (consumers and producer tasks running independently of each other,
> or consumers and producer tasks must run at the same time).
>
>
> On Fri, Jul 25, 2014 at 12:03 PM, David Capwell <dc...@gmail.com>
> wrote:
>
>> Was looking into saying that when two vertexes share data, that they can
>> choose to share that data over disk, or over our internal system (so share
>> over network).  In the cases where data persistence isn't needed and the
>> vertexes can be on the same node, then to ignore this system.
>>
>> The use-case isn't really fleshed out at the moment.  Looking to
>> prototype to see how it would all play together.
>>
>>
>> On Fri, Jul 25, 2014 at 11:53 AM, Siddharth Seth <ss...@apache.org>
>> wrote:
>>
>>> DataSourceType isn't really used at the moment. Eventually, it would
>>> serve more as a scheduling and failure recovery mechanism more than
>>> deciding how data gets persisted between stages. (This property could
>>> potentially be used by some of the Inputs/Outputs to alter the way they
>>> persist data - but that isn't currently on the cards).
>>> This primarily applies to data written on Edges - are you somehow
>>> looking to modify that, or use the data generated by an intermediate Vertex
>>> in a separate process ?
>>> Getting a little more info on the use case would be helpful in figuring
>>> out how Tez can be used. Are you looking to read data from this internal
>>> service, publish to it, or something else ?
>>>
>>>
>>> On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <dc...@gmail.com>
>>> wrote:
>>>
>>>> Sorry, copy/paste issue.  I was looking at DataSourceType and trying to
>>>> see how data gets saved and read between tasks.  The use-case is that we
>>>> have an internal service that might be helpful for us, so wanted to
>>>> prototype how possible it would be to share data over different mechanism.
>>>>
>>>>
>>>> On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org>
>>>> wrote:
>>>>
>>>>> DataMovementEvent is a construct defined for an Input/Output pair to
>>>>> communicate with each other. The actual information being passed between
>>>>> the 2 is not understood by the framework except in that, it is a byte
>>>>> payload to be handed off from the source to the destination. Users are not
>>>>> expected to create derived classes of this type but to use the payload
>>>>> within the object to pass information around.
>>>>>
>>>>> For example, most of the currently implemented Input-Output pairs (
>>>>> for shuffle/broadcast edges ) use the payload to pass the url specifying
>>>>> the location of the data to be fetched.
>>>>>
>>>>> thanks
>>>>> — HItesh
>>>>>
>>>>> On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > So going through the code and not sure where the real logic of
>>>>> DataMovementType gets used.
>>>>> >
>>>>> > I see that in DagTypeConverts it can convert between
>>>>> DataMovementType and PlanEdgeDataMovementType, but once that happens I
>>>>> don't really see a way to implement any of these types.  Where is the
>>>>> implementations defined? Is there any way to define my own impls?
>>>>> >
>>>>> > Thanks for your time.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: DataMovementType impls

Posted by Siddharth Seth <ss...@apache.org>.

Doing something like that would involve writing a new Outputs / Inputs, or
modifying the existing ones to write to a different sink. We have
prototyped such changes in the past - to write to HDFS as an example, and
the changes are not very complicated.
This involves changing how the existing Outputs write data, modifying
DataMovementEvent payloads to contain relevant data (where to fetch from),
and changing the Inputs to process this DataMovement payload to actually
fetch the data.
One thing to look at though - is that if you're writing directly to your
own service - will the data be persisted there, until it's read be the
downstream vertex - or does the data effectively need to be streamed
through (consumers and producer tasks running independently of each other,
or consumers and producer tasks must run at the same time).


On Fri, Jul 25, 2014 at 12:03 PM, David Capwell <dc...@gmail.com> wrote:

> Was looking into saying that when two vertexes share data, that they can
> choose to share that data over disk, or over our internal system (so share
> over network).  In the cases where data persistence isn't needed and the
> vertexes can be on the same node, then to ignore this system.
>
> The use-case isn't really fleshed out at the moment.  Looking to prototype
> to see how it would all play together.
>
>
> On Fri, Jul 25, 2014 at 11:53 AM, Siddharth Seth <ss...@apache.org> wrote:
>
>> DataSourceType isn't really used at the moment. Eventually, it would
>> serve more as a scheduling and failure recovery mechanism more than
>> deciding how data gets persisted between stages. (This property could
>> potentially be used by some of the Inputs/Outputs to alter the way they
>> persist data - but that isn't currently on the cards).
>> This primarily applies to data written on Edges - are you somehow looking
>> to modify that, or use the data generated by an intermediate Vertex in a
>> separate process ?
>> Getting a little more info on the use case would be helpful in figuring
>> out how Tez can be used. Are you looking to read data from this internal
>> service, publish to it, or something else ?
>>
>>
>> On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <dc...@gmail.com>
>> wrote:
>>
>>> Sorry, copy/paste issue.  I was looking at DataSourceType and trying to
>>> see how data gets saved and read between tasks.  The use-case is that we
>>> have an internal service that might be helpful for us, so wanted to
>>> prototype how possible it would be to share data over different mechanism.
>>>
>>>
>>> On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org> wrote:
>>>
>>>> DataMovementEvent is a construct defined for an Input/Output pair to
>>>> communicate with each other. The actual information being passed between
>>>> the 2 is not understood by the framework except in that, it is a byte
>>>> payload to be handed off from the source to the destination. Users are not
>>>> expected to create derived classes of this type but to use the payload
>>>> within the object to pass information around.
>>>>
>>>> For example, most of the currently implemented Input-Output pairs ( for
>>>> shuffle/broadcast edges ) use the payload to pass the url specifying the
>>>> location of the data to be fetched.
>>>>
>>>> thanks
>>>> — HItesh
>>>>
>>>> On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:
>>>>
>>>> > So going through the code and not sure where the real logic of
>>>> DataMovementType gets used.
>>>> >
>>>> > I see that in DagTypeConverts it can convert between DataMovementType
>>>> and PlanEdgeDataMovementType, but once that happens I don't really see a
>>>> way to implement any of these types.  Where is the implementations defined?
>>>> Is there any way to define my own impls?
>>>> >
>>>> > Thanks for your time.
>>>>
>>>>
>>>
>>
>

Re: DataMovementType impls

Posted by David Capwell <dc...@gmail.com>.

Was looking into saying that when two vertexes share data, that they can
choose to share that data over disk, or over our internal system (so share
over network).  In the cases where data persistence isn't needed and the
vertexes can be on the same node, then to ignore this system.

The use-case isn't really fleshed out at the moment.  Looking to prototype
to see how it would all play together.


On Fri, Jul 25, 2014 at 11:53 AM, Siddharth Seth <ss...@apache.org> wrote:

> DataSourceType isn't really used at the moment. Eventually, it would serve
> more as a scheduling and failure recovery mechanism more than deciding how
> data gets persisted between stages. (This property could potentially be
> used by some of the Inputs/Outputs to alter the way they persist data - but
> that isn't currently on the cards).
> This primarily applies to data written on Edges - are you somehow looking
> to modify that, or use the data generated by an intermediate Vertex in a
> separate process ?
> Getting a little more info on the use case would be helpful in figuring
> out how Tez can be used. Are you looking to read data from this internal
> service, publish to it, or something else ?
>
>
> On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <dc...@gmail.com>
> wrote:
>
>> Sorry, copy/paste issue.  I was looking at DataSourceType and trying to
>> see how data gets saved and read between tasks.  The use-case is that we
>> have an internal service that might be helpful for us, so wanted to
>> prototype how possible it would be to share data over different mechanism.
>>
>>
>> On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org> wrote:
>>
>>> DataMovementEvent is a construct defined for an Input/Output pair to
>>> communicate with each other. The actual information being passed between
>>> the 2 is not understood by the framework except in that, it is a byte
>>> payload to be handed off from the source to the destination. Users are not
>>> expected to create derived classes of this type but to use the payload
>>> within the object to pass information around.
>>>
>>> For example, most of the currently implemented Input-Output pairs ( for
>>> shuffle/broadcast edges ) use the payload to pass the url specifying the
>>> location of the data to be fetched.
>>>
>>> thanks
>>> — HItesh
>>>
>>> On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:
>>>
>>> > So going through the code and not sure where the real logic of
>>> DataMovementType gets used.
>>> >
>>> > I see that in DagTypeConverts it can convert between DataMovementType
>>> and PlanEdgeDataMovementType, but once that happens I don't really see a
>>> way to implement any of these types.  Where is the implementations defined?
>>> Is there any way to define my own impls?
>>> >
>>> > Thanks for your time.
>>>
>>>
>>
>

Re: DataMovementType impls

Posted by Siddharth Seth <ss...@apache.org>.

DataSourceType isn't really used at the moment. Eventually, it would serve
more as a scheduling and failure recovery mechanism more than deciding how
data gets persisted between stages. (This property could potentially be
used by some of the Inputs/Outputs to alter the way they persist data - but
that isn't currently on the cards).
This primarily applies to data written on Edges - are you somehow looking
to modify that, or use the data generated by an intermediate Vertex in a
separate process ?
Getting a little more info on the use case would be helpful in figuring out
how Tez can be used. Are you looking to read data from this internal
service, publish to it, or something else ?

On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <dc...@gmail.com> wrote:

> Sorry, copy/paste issue.  I was looking at DataSourceType and trying to
> see how data gets saved and read between tasks.  The use-case is that we
> have an internal service that might be helpful for us, so wanted to
> prototype how possible it would be to share data over different mechanism.
>
>
> On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org> wrote:
>
>> DataMovementEvent is a construct defined for an Input/Output pair to
>> communicate with each other. The actual information being passed between
>> the 2 is not understood by the framework except in that, it is a byte
>> payload to be handed off from the source to the destination. Users are not
>> expected to create derived classes of this type but to use the payload
>> within the object to pass information around.
>>
>> For example, most of the currently implemented Input-Output pairs ( for
>> shuffle/broadcast edges ) use the payload to pass the url specifying the
>> location of the data to be fetched.
>>
>> thanks
>> — HItesh
>>
>> On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:
>>
>> > So going through the code and not sure where the real logic of
>> DataMovementType gets used.
>> >
>> > I see that in DagTypeConverts it can convert between DataMovementType
>> and PlanEdgeDataMovementType, but once that happens I don't really see a
>> way to implement any of these types.  Where is the implementations defined?
>> Is there any way to define my own impls?
>> >
>> > Thanks for your time.
>>
>>
>

Re: DataMovementType impls

Posted by David Capwell <dc...@gmail.com>.

Sorry, copy/paste issue.  I was looking at DataSourceType and trying to see
how data gets saved and read between tasks.  The use-case is that we have
an internal service that might be helpful for us, so wanted to prototype
how possible it would be to share data over different mechanism.


On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <hi...@apache.org> wrote:

> DataMovementEvent is a construct defined for an Input/Output pair to
> communicate with each other. The actual information being passed between
> the 2 is not understood by the framework except in that, it is a byte
> payload to be handed off from the source to the destination. Users are not
> expected to create derived classes of this type but to use the payload
> within the object to pass information around.
>
> For example, most of the currently implemented Input-Output pairs ( for
> shuffle/broadcast edges ) use the payload to pass the url specifying the
> location of the data to be fetched.
>
> thanks
> — HItesh
>
> On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:
>
> > So going through the code and not sure where the real logic of
> DataMovementType gets used.
> >
> > I see that in DagTypeConverts it can convert between DataMovementType
> and PlanEdgeDataMovementType, but once that happens I don't really see a
> way to implement any of these types.  Where is the implementations defined?
> Is there any way to define my own impls?
> >
> > Thanks for your time.
>
>

Re: DataMovementType impls

Posted by Hitesh Shah <hi...@apache.org>.

DataMovementEvent is a construct defined for an Input/Output pair to communicate with each other. The actual information being passed between the 2 is not understood by the framework except in that, it is a byte payload to be handed off from the source to the destination. Users are not expected to create derived classes of this type but to use the payload within the object to pass information around. 

For example, most of the currently implemented Input-Output pairs ( for shuffle/broadcast edges ) use the payload to pass the url specifying the location of the data to be fetched. 

thanks
— HItesh

On Jul 25, 2014, at 10:23 AM, David Capwell <dc...@gmail.com> wrote:

> So going through the code and not sure where the real logic of DataMovementType gets used.
> 
> I see that in DagTypeConverts it can convert between DataMovementType and PlanEdgeDataMovementType, but once that happens I don't really see a way to implement any of these types.  Where is the implementations defined? Is there any way to define my own impls?
> 
> Thanks for your time.