You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Oleg Zhurakousky <oz...@hortonworks.com> on 2015/05/18 16:27:18 UTC

Ten processor with multiple inputs

Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 


Thanks
Oleg

Re: Ten processor with multiple inputs

Posted by Hitesh Shah <hi...@apache.org>.

This is with respect to how work is assigned to a Task. For a shuffle edge, a Task’s input is determined based on the partitions and how partitions are assigned to a Task. For a vertex reading data from HDFS ( initial input ). this is effectively random as the input data is split up and then assigned to tasks. 

When trying to combine the data, the user would need to write a custom vertex manager to handle correctly assigning data from the initial input and the shuffle edge in a deterministic manner ( and other user-specific conditions such as trying to do a “join” ) for the processing to be correctly done.

I believe Hive has a couple of cases where this is implemented. You should ask on the dev@hive list for more details. 

— Hitesh

On May 18, 2015, at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:

> Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
> Is there a target timeframe for this? JIRA?
> 
> Thanks
> Oleg
> 
>> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:
>> 
>> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
>> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
>> 
>> 
>> Thanks
>> Oleg
>

Re: Ten processor with multiple inputs

Posted by Hitesh Shah <hi...@apache.org>.

This is with respect to how work is assigned to a Task. For a shuffle edge, a Task’s input is determined based on the partitions and how partitions are assigned to a Task. For a vertex reading data from HDFS ( initial input ). this is effectively random as the input data is split up and then assigned to tasks. 

When trying to combine the data, the user would need to write a custom vertex manager to handle correctly assigning data from the initial input and the shuffle edge in a deterministic manner ( and other user-specific conditions such as trying to do a “join” ) for the processing to be correctly done.

I believe Hive has a couple of cases where this is implemented. You should ask on the dev@hive list for more details. 

— Hitesh

On May 18, 2015, at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:

> Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
> Is there a target timeframe for this? JIRA?
> 
> Thanks
> Oleg
> 
>> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:
>> 
>> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
>> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
>> 
>> 
>> Thanks
>> Oleg
>

Re: Ten processor with multiple inputs

Posted by Jeff Zhang <zj...@gmail.com>.

Sorry, the edge between A and C is broadcast edge.

On Wed, May 20, 2015 at 5:01 PM, Jeff Zhang <zj...@gmail.com> wrote:

> >- which means that join stage could process hash and wait for probe
> making it a 3 stage DAG. However what you see is a 4 stage DAG, since join
> will require shuffle on the ‘hash’.
>
> I guess the 3 stage DAG means the dag in spark, and 4 stage DAG means the
> dag you build in tez. However, I think you can use 3 stage(vertex) dag in
> tez as following. A process the hash data and B process the probe data.
> Edge between A and C is one-to-one edge while edge between B and C is
> scatter gather.   C is a little tricky that it do both the reduce and join
> as long as the reduce key and join key are the same.
>
>
> A     B
>   \   /
>    C
>
>
>
> On Wed, May 20, 2015 at 3:25 PM, Siddharth Seth <ss...@apache.org> wrote:
>
>> I'm assuming you intend on having one vertex for 'joined'. This vertex
>> processes 'hash' while waiting for data to come in from 'probe' (which is
>> doing a shuffle / partition) ?
>> The Processor on the 'joined' vertex can absolutely go ahead and process
>> hash while the Shuffle from the other side is happening. It can notified
>> when the Shuffle is complete via waitForInputReady() - if that's useful for
>> this scenario.
>>
>> Data organization would probably make a difference though - and how hash
>> is to be consumed by different partitions.
>>
>>
>> On Tue, May 19, 2015 at 4:19 AM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>>
>>>  Thanks Sid
>>>
>>>  Let’s take a classic join as an example, which contains hash and probe
>>> inputs. It is currently assumed that both inputs will be DataSources or
>>> Shuffles. This means that even in the cases where you would not need a
>>> shuffle you now have to create one.
>>> Let me describe it via Spark
>>>
>>>  val hash = rdd.map(..)
>>>
>>>  val probe = rdd.map(..).reduceByKey(..)
>>>
>>>  val joined = hash.join(probe)
>>>
>>>
>>>  As you can see from the above, the computation of ‘hash’ does not
>>> require shuffle, while ‘probe’ does, which means that join stage could
>>> process hash and wait for probe making it a 3 stage DAG. However what you
>>> see is a 4 stage DAG, since join will require shuffle on the ‘hash’.
>>> Again, this is not about Spark, just using it to explain semantics.
>>>
>>>
>>>  Now as far as ProcessContext. I know I have a handle to it in my
>>> processor. But by that time it is too late since processor is not created
>>> until its inputs are available. That is why I was thinking that may be in
>>> the DAG API there is some flag to set. Am i missing something?
>>>
>>>  Thanks
>>> Oleg
>>>
>>>
>>>  On May 19, 2015, at 2:10 AM, Siddharth Seth <ss...@apache.org> wrote:
>>>
>>>  These APIs are available during execution of the Processor. They're a
>>> mechanism to get notified and wait till certain Inputs are ready, or get
>>> notified on an Input being ready while another is being processed. There's
>>> nothing on the DAG API for this. What are you looking to do ? One thing to
>>> note though - the Inputs are not thread safe, and should be consumed from
>>> the same thread or with external synchronization.
>>>
>>> On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <
>>> ozhurakousky@hortonworks.com> wrote:
>>>
>>>> Thanks Sid
>>>>
>>>>  So, any pointer on how one would interact with it. I mean all I do is
>>>> assemble DAG and I can’t seem to see anything on the Vertex that would
>>>> allow me to do that.
>>>>
>>>>  Thanks
>>>>  Oleg
>>>>
>>>>
>>>>
>>>>
>>>>  On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org> wrote:
>>>>
>>>>  There's APIs on the ProcessorContext - waitForAllInputsReady,
>>>> waitForAnyInputReady - which can be used to figure out when a specific
>>>> Input is ready for consumption. That should solve the first question.
>>>>
>>>>  Regarding vertices with multiple Inputs and Shuffle - that requires a
>>>> custom VertexManager plugin to figure out how the splits are to be
>>>> distributed to the various tasks. Also have to make sure that the number of
>>>> tasks is setup correctly - likely according to the Shuffle edge.
>>>>
>>>> On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
>>>> ozhurakousky@hortonworks.com> wrote:
>>>>
>>>>> Also, while trying something related to this i’ve noticed the
>>>>> following: "A vertex with an Initial Input and a Shuffle Input are not
>>>>> supported at the moment”.
>>>>> Is there a target timeframe for this? JIRA?
>>>>>
>>>>> Thanks
>>>>> Oleg
>>>>>
>>>>> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
>>>>> ozhurakousky@hortonworks.com> wrote:
>>>>> >
>>>>> > Is it possible to allow Tez processor implementation which has
>>>>> multiple inputs to become available as soon as at least one input is
>>>>> available to be read.
>>>>> > This could allow for some computation to begin while waiting for
>>>>> other inputs. Other inputs could (if logic allows) be processed as they
>>>>> become available.
>>>>> >
>>>>> >
>>>>> > Thanks
>>>>> > Oleg
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang

Re: Ten processor with multiple inputs

Posted by Jeff Zhang <zj...@gmail.com>.

>- which means that join stage could process hash and wait for probe making
it a 3 stage DAG. However what you see is a 4 stage DAG, since join will
require shuffle on the ‘hash’.

I guess the 3 stage DAG means the dag in spark, and 4 stage DAG means the
dag you build in tez. However, I think you can use 3 stage(vertex) dag in
tez as following. A process the hash data and B process the probe data.
Edge between A and C is one-to-one edge while edge between B and C is
scatter gather.   C is a little tricky that it do both the reduce and join
as long as the reduce key and join key are the same.


A     B
  \   /
   C



On Wed, May 20, 2015 at 3:25 PM, Siddharth Seth <ss...@apache.org> wrote:

> I'm assuming you intend on having one vertex for 'joined'. This vertex
> processes 'hash' while waiting for data to come in from 'probe' (which is
> doing a shuffle / partition) ?
> The Processor on the 'joined' vertex can absolutely go ahead and process
> hash while the Shuffle from the other side is happening. It can notified
> when the Shuffle is complete via waitForInputReady() - if that's useful for
> this scenario.
>
> Data organization would probably make a difference though - and how hash
> is to be consumed by different partitions.
>
>
> On Tue, May 19, 2015 at 4:19 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>>  Thanks Sid
>>
>>  Let’s take a classic join as an example, which contains hash and probe
>> inputs. It is currently assumed that both inputs will be DataSources or
>> Shuffles. This means that even in the cases where you would not need a
>> shuffle you now have to create one.
>> Let me describe it via Spark
>>
>>  val hash = rdd.map(..)
>>
>>  val probe = rdd.map(..).reduceByKey(..)
>>
>>  val joined = hash.join(probe)
>>
>>
>>  As you can see from the above, the computation of ‘hash’ does not
>> require shuffle, while ‘probe’ does, which means that join stage could
>> process hash and wait for probe making it a 3 stage DAG. However what you
>> see is a 4 stage DAG, since join will require shuffle on the ‘hash’.
>> Again, this is not about Spark, just using it to explain semantics.
>>
>>
>>  Now as far as ProcessContext. I know I have a handle to it in my
>> processor. But by that time it is too late since processor is not created
>> until its inputs are available. That is why I was thinking that may be in
>> the DAG API there is some flag to set. Am i missing something?
>>
>>  Thanks
>> Oleg
>>
>>
>>  On May 19, 2015, at 2:10 AM, Siddharth Seth <ss...@apache.org> wrote:
>>
>>  These APIs are available during execution of the Processor. They're a
>> mechanism to get notified and wait till certain Inputs are ready, or get
>> notified on an Input being ready while another is being processed. There's
>> nothing on the DAG API for this. What are you looking to do ? One thing to
>> note though - the Inputs are not thread safe, and should be consumed from
>> the same thread or with external synchronization.
>>
>> On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>>
>>> Thanks Sid
>>>
>>>  So, any pointer on how one would interact with it. I mean all I do is
>>> assemble DAG and I can’t seem to see anything on the Vertex that would
>>> allow me to do that.
>>>
>>>  Thanks
>>>  Oleg
>>>
>>>
>>>
>>>
>>>  On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org> wrote:
>>>
>>>  There's APIs on the ProcessorContext - waitForAllInputsReady,
>>> waitForAnyInputReady - which can be used to figure out when a specific
>>> Input is ready for consumption. That should solve the first question.
>>>
>>>  Regarding vertices with multiple Inputs and Shuffle - that requires a
>>> custom VertexManager plugin to figure out how the splits are to be
>>> distributed to the various tasks. Also have to make sure that the number of
>>> tasks is setup correctly - likely according to the Shuffle edge.
>>>
>>> On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
>>> ozhurakousky@hortonworks.com> wrote:
>>>
>>>> Also, while trying something related to this i’ve noticed the
>>>> following: "A vertex with an Initial Input and a Shuffle Input are not
>>>> supported at the moment”.
>>>> Is there a target timeframe for this? JIRA?
>>>>
>>>> Thanks
>>>> Oleg
>>>>
>>>> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
>>>> ozhurakousky@hortonworks.com> wrote:
>>>> >
>>>> > Is it possible to allow Tez processor implementation which has
>>>> multiple inputs to become available as soon as at least one input is
>>>> available to be read.
>>>> > This could allow for some computation to begin while waiting for
>>>> other inputs. Other inputs could (if logic allows) be processed as they
>>>> become available.
>>>> >
>>>> >
>>>> > Thanks
>>>> > Oleg
>>>>
>>>>
>>>
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Ten processor with multiple inputs

Posted by Siddharth Seth <ss...@apache.org>.

I'm assuming you intend on having one vertex for 'joined'. This vertex
processes 'hash' while waiting for data to come in from 'probe' (which is
doing a shuffle / partition) ?
The Processor on the 'joined' vertex can absolutely go ahead and process
hash while the Shuffle from the other side is happening. It can notified
when the Shuffle is complete via waitForInputReady() - if that's useful for
this scenario.

Data organization would probably make a difference though - and how hash is
to be consumed by different partitions.


On Tue, May 19, 2015 at 4:19 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

>  Thanks Sid
>
>  Let’s take a classic join as an example, which contains hash and probe
> inputs. It is currently assumed that both inputs will be DataSources or
> Shuffles. This means that even in the cases where you would not need a
> shuffle you now have to create one.
> Let me describe it via Spark
>
>  val hash = rdd.map(..)
>
>  val probe = rdd.map(..).reduceByKey(..)
>
>  val joined = hash.join(probe)
>
>
>  As you can see from the above, the computation of ‘hash’ does not
> require shuffle, while ‘probe’ does, which means that join stage could
> process hash and wait for probe making it a 3 stage DAG. However what you
> see is a 4 stage DAG, since join will require shuffle on the ‘hash’.
> Again, this is not about Spark, just using it to explain semantics.
>
>
>  Now as far as ProcessContext. I know I have a handle to it in my
> processor. But by that time it is too late since processor is not created
> until its inputs are available. That is why I was thinking that may be in
> the DAG API there is some flag to set. Am i missing something?
>
>  Thanks
> Oleg
>
>
>  On May 19, 2015, at 2:10 AM, Siddharth Seth <ss...@apache.org> wrote:
>
>  These APIs are available during execution of the Processor. They're a
> mechanism to get notified and wait till certain Inputs are ready, or get
> notified on an Input being ready while another is being processed. There's
> nothing on the DAG API for this. What are you looking to do ? One thing to
> note though - the Inputs are not thread safe, and should be consumed from
> the same thread or with external synchronization.
>
> On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> Thanks Sid
>>
>>  So, any pointer on how one would interact with it. I mean all I do is
>> assemble DAG and I can’t seem to see anything on the Vertex that would
>> allow me to do that.
>>
>>  Thanks
>>  Oleg
>>
>>
>>
>>
>>  On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org> wrote:
>>
>>  There's APIs on the ProcessorContext - waitForAllInputsReady,
>> waitForAnyInputReady - which can be used to figure out when a specific
>> Input is ready for consumption. That should solve the first question.
>>
>>  Regarding vertices with multiple Inputs and Shuffle - that requires a
>> custom VertexManager plugin to figure out how the splits are to be
>> distributed to the various tasks. Also have to make sure that the number of
>> tasks is setup correctly - likely according to the Shuffle edge.
>>
>> On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>>
>>> Also, while trying something related to this i’ve noticed the following:
>>> "A vertex with an Initial Input and a Shuffle Input are not supported at
>>> the moment”.
>>> Is there a target timeframe for this? JIRA?
>>>
>>> Thanks
>>> Oleg
>>>
>>> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
>>> ozhurakousky@hortonworks.com> wrote:
>>> >
>>> > Is it possible to allow Tez processor implementation which has
>>> multiple inputs to become available as soon as at least one input is
>>> available to be read.
>>> > This could allow for some computation to begin while waiting for other
>>> inputs. Other inputs could (if logic allows) be processed as they become
>>> available.
>>> >
>>> >
>>> > Thanks
>>> > Oleg
>>>
>>>
>>
>>
>
>

RE: Ten processor with multiple inputs

Posted by Bikas Saha <bi...@hortonworks.com>.

Processor is created along with the inputs/outputs (and not after inputs are available). In a container, for the first instance of a task in a vertex, Tez will auto-start inputs (this was done for hive) but after that the processor must call start on its inputs before the inputs start doing anything. You may be seeing that auto-start behavior.

From: Oleg Zhurakousky [mailto:ozhurakousky@hortonworks.com]
Sent: Tuesday, May 19, 2015 4:20 AM
To: user@tez.apache.org
Subject: Re: Ten processor with multiple inputs

Thanks Sid

Let’s take a classic join as an example, which contains hash and probe inputs. It is currently assumed that both inputs will be DataSources or Shuffles. This means that even in the cases where you would not need a shuffle you now have to create one.
Let me describe it via Spark

val hash = rdd.map(..)

val probe = rdd.map(..).reduceByKey(..)

val joined = hash.join(probe)

As you can see from the above, the computation of ‘hash’ does not require shuffle, while ‘probe’ does, which means that join stage could process hash and wait for probe making it a 3 stage DAG. However what you see is a 4 stage DAG, since join will require shuffle on the ‘hash’.
Again, this is not about Spark, just using it to explain semantics.

Now as far as ProcessContext. I know I have a handle to it in my processor. But by that time it is too late since processor is not created until its inputs are available. That is why I was thinking that may be in the DAG API there is some flag to set. Am i missing something?

Thanks
Oleg

On May 19, 2015, at 2:10 AM, Siddharth Seth <ss...@apache.org>> wrote:

These APIs are available during execution of the Processor. They're a mechanism to get notified and wait till certain Inputs are ready, or get notified on an Input being ready while another is being processed. There's nothing on the DAG API for this. What are you looking to do ? One thing to note though - the Inputs are not thread safe, and should be consumed from the same thread or with external synchronization.

On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Thanks Sid

So, any pointer on how one would interact with it. I mean all I do is assemble DAG and I can’t seem to see anything on the Vertex that would allow me to do that.

Thanks
Oleg

On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org>> wrote:

There's APIs on the ProcessorContext - waitForAllInputsReady, waitForAnyInputReady - which can be used to figure out when a specific Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a custom VertexManager plugin to figure out how the splits are to be distributed to the various tasks. Also have to make sure that the number of tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
>
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available.
>
>
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Thanks Sid

Let’s take a classic join as an example, which contains hash and probe inputs. It is currently assumed that both inputs will be DataSources or Shuffles. This means that even in the cases where you would not need a shuffle you now have to create one.
Let me describe it via Spark

val hash = rdd.map(..)

val probe = rdd.map(..).reduceByKey(..)

val joined = hash.join(probe)

As you can see from the above, the computation of ‘hash’ does not require shuffle, while ‘probe’ does, which means that join stage could process hash and wait for probe making it a 3 stage DAG. However what you see is a 4 stage DAG, since join will require shuffle on the ‘hash’.
Again, this is not about Spark, just using it to explain semantics.

Now as far as ProcessContext. I know I have a handle to it in my processor. But by that time it is too late since processor is not created until its inputs are available. That is why I was thinking that may be in the DAG API there is some flag to set. Am i missing something?

Thanks
Oleg

On May 19, 2015, at 2:10 AM, Siddharth Seth <ss...@apache.org>> wrote:

These APIs are available during execution of the Processor. They're a mechanism to get notified and wait till certain Inputs are ready, or get notified on an Input being ready while another is being processed. There's nothing on the DAG API for this. What are you looking to do ? One thing to note though - the Inputs are not thread safe, and should be consumed from the same thread or with external synchronization.

On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Thanks Sid

So, any pointer on how one would interact with it. I mean all I do is assemble DAG and I can’t seem to see anything on the Vertex that would allow me to do that.

Thanks
Oleg

On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org>> wrote:

There's APIs on the ProcessorContext - waitForAllInputsReady, waitForAnyInputReady - which can be used to figure out when a specific Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a custom VertexManager plugin to figure out how the splits are to be distributed to the various tasks. Also have to make sure that the number of tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
>
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available.
>
>
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Siddharth Seth <ss...@apache.org>.

These APIs are available during execution of the Processor. They're a
mechanism to get notified and wait till certain Inputs are ready, or get
notified on an Input being ready while another is being processed. There's
nothing on the DAG API for this. What are you looking to do ? One thing to
note though - the Inputs are not thread safe, and should be consumed from
the same thread or with external synchronization.

On Mon, May 18, 2015 at 11:07 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

>  Thanks Sid
>
>  So, any pointer on how one would interact with it. I mean all I do is
> assemble DAG and I can’t seem to see anything on the Vertex that would
> allow me to do that.
>
>  Thanks
> Oleg
>
>
>
>
>  On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org> wrote:
>
>  There's APIs on the ProcessorContext - waitForAllInputsReady,
> waitForAnyInputReady - which can be used to figure out when a specific
> Input is ready for consumption. That should solve the first question.
>
>  Regarding vertices with multiple Inputs and Shuffle - that requires a
> custom VertexManager plugin to figure out how the splits are to be
> distributed to the various tasks. Also have to make sure that the number of
> tasks is setup correctly - likely according to the Shuffle edge.
>
> On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> Also, while trying something related to this i’ve noticed the following:
>> "A vertex with an Initial Input and a Shuffle Input are not supported at
>> the moment”.
>> Is there a target timeframe for this? JIRA?
>>
>> Thanks
>> Oleg
>>
>> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>> >
>> > Is it possible to allow Tez processor implementation which has multiple
>> inputs to become available as soon as at least one input is available to be
>> read.
>> > This could allow for some computation to begin while waiting for other
>> inputs. Other inputs could (if logic allows) be processed as they become
>> available.
>> >
>> >
>> > Thanks
>> > Oleg
>>
>>
>
>

Re: Ten processor with multiple inputs

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Thanks Sid

So, any pointer on how one would interact with it. I mean all I do is assemble DAG and I can’t seem to see anything on the Vertex that would allow me to do that.

Thanks
Oleg

On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org>> wrote:

There's APIs on the ProcessorContext - waitForAllInputsReady, waitForAnyInputReady - which can be used to figure out when a specific Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a custom VertexManager plugin to figure out how the splits are to be distributed to the various tasks. Also have to make sure that the number of tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
>
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available.
>
>
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Thanks Sid

So, any pointer on how one would interact with it. I mean all I do is assemble DAG and I can’t seem to see anything on the Vertex that would allow me to do that.

Thanks
Oleg

On May 18, 2015, at 2:00 PM, Siddharth Seth <ss...@apache.org>> wrote:

There's APIs on the ProcessorContext - waitForAllInputsReady, waitForAnyInputReady - which can be used to figure out when a specific Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a custom VertexManager plugin to figure out how the splits are to be distributed to the various tasks. Also have to make sure that the number of tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com>> wrote:
>
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available.
>
>
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Siddharth Seth <ss...@apache.org>.

There's APIs on the ProcessorContext - waitForAllInputsReady,
waitForAnyInputReady - which can be used to figure out when a specific
Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a
custom VertexManager plugin to figure out how the splits are to be
distributed to the various tasks. Also have to make sure that the number of
tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> Also, while trying something related to this i’ve noticed the following:
> "A vertex with an Initial Input and a Shuffle Input are not supported at
> the moment”.
> Is there a target timeframe for this? JIRA?
>
> Thanks
> Oleg
>
> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
> >
> > Is it possible to allow Tez processor implementation which has multiple
> inputs to become available as soon as at least one input is available to be
> read.
> > This could allow for some computation to begin while waiting for other
> inputs. Other inputs could (if logic allows) be processed as they become
> available.
> >
> >
> > Thanks
> > Oleg
>
>

Re: Ten processor with multiple inputs

Posted by Siddharth Seth <ss...@apache.org>.

There's APIs on the ProcessorContext - waitForAllInputsReady,
waitForAnyInputReady - which can be used to figure out when a specific
Input is ready for consumption. That should solve the first question.

Regarding vertices with multiple Inputs and Shuffle - that requires a
custom VertexManager plugin to figure out how the splits are to be
distributed to the various tasks. Also have to make sure that the number of
tasks is setup correctly - likely according to the Shuffle edge.

On Mon, May 18, 2015 at 9:00 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> Also, while trying something related to this i’ve noticed the following:
> "A vertex with an Initial Input and a Shuffle Input are not supported at
> the moment”.
> Is there a target timeframe for this? JIRA?
>
> Thanks
> Oleg
>
> > On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
> >
> > Is it possible to allow Tez processor implementation which has multiple
> inputs to become available as soon as at least one input is available to be
> read.
> > This could allow for some computation to begin while waiting for other
> inputs. Other inputs could (if logic allows) be processed as they become
> available.
> >
> >
> > Thanks
> > Oleg
>
>

Re: Ten processor with multiple inputs

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:
> 
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
> 
> 
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Hitesh Shah <hi...@apache.org>.

There is nothing that prevents a processor running and finishing without even reading any data from any input. The only point when the processor blocks is when it tries to read data from a particular input that has not yet finished fetching all of its data.

That said, a processor cannot yet query an Input to check if its data is available and ready to consume. The assumption currently in place is that a processor will only read data from an Input when it needs to do so.

thanks
— Hitesh

On May 18, 2015, at 7:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:

> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
> 
> 
> Thanks
> Oleg

RE: Ten processor with multiple inputs

Posted by Bikas Saha <bi...@hortonworks.com>.

All inputs have a waitForReady() method (with flavors) that can be used by the processor to wait as it deems fit.

-----Original Message-----
From: Oleg Zhurakousky [mailto:ozhurakousky@hortonworks.com] 
Sent: Monday, May 18, 2015 7:27 AM
To: user@tez.apache.org; dev@tez.apache.org
Subject: Ten processor with multiple inputs

Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 

Thanks
Oleg

Re: Ten processor with multiple inputs

Posted by Hitesh Shah <hi...@apache.org>.

There is nothing that prevents a processor running and finishing without even reading any data from any input. The only point when the processor blocks is when it tries to read data from a particular input that has not yet finished fetching all of its data.

That said, a processor cannot yet query an Input to check if its data is available and ready to consume. The assumption currently in place is that a processor will only read data from an Input when it needs to do so.

thanks
— Hitesh

On May 18, 2015, at 7:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:

> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
> 
> 
> Thanks
> Oleg

Re: Ten processor with multiple inputs

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Also, while trying something related to this i’ve noticed the following: "A vertex with an Initial Input and a Shuffle Input are not supported at the moment”.
Is there a target timeframe for this? JIRA?

Thanks
Oleg

> On May 18, 2015, at 10:27 AM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:
> 
> Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
> This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 
> 
> 
> Thanks
> Oleg

RE: Ten processor with multiple inputs

Posted by Bikas Saha <bi...@hortonworks.com>.

All inputs have a waitForReady() method (with flavors) that can be used by the processor to wait as it deems fit.

-----Original Message-----
From: Oleg Zhurakousky [mailto:ozhurakousky@hortonworks.com] 
Sent: Monday, May 18, 2015 7:27 AM
To: user@tez.apache.org; dev@tez.apache.org
Subject: Ten processor with multiple inputs

Is it possible to allow Tez processor implementation which has multiple inputs to become available as soon as at least one input is available to be read.
This could allow for some computation to begin while waiting for other inputs. Other inputs could (if logic allows) be processed as they become available. 

Thanks
Oleg