You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by Chang Chen <ba...@gmail.com> on 2015/03/25 04:43:58 UTC

Which computation model does Tez supports

Hi

from the PhD Disseration
<http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of Matei
Zaharia, there are four computation models in the large scale clusters:


   1. *Iterative algorithm*, such as graph processing and machine leaning
   algorithm
   2. *Relational query*
   3. *MapReduce*, a general parallel computation model
   4. *Stream processing*,

Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
examples.

As for streaming, I guess if we implement appropriate input,  there is no
reason that tez can't support in theory.

But for Machine Leaning, how do we use vertex and edge to express *Logistic
Regression*?

Thanks
Chang

Re: Which computation model does Tez supports

Posted by Hitesh Shah <hi...@apache.org>.

Not today for sure. But there have been thoughts on how we can make the VertexManager more powerful. One jira that Gopal filed was to be able to short-circuit processing of a Vertex when certain conditions are met ( https://issues.apache.org/jira/browse/TEZ-2103 ). This could then be leveraged in multiple ways. One approach could be for VertexManagers to send events to downstream Vertices where the respective vertex managers could short-circuit some of their processing. 

Again, just to be clear, none of this is implemented and the above is just potential hand waving in terms of what is possible. It really all depends on help from the folks in the community to work on this if anyone is interested in tackling this aspect. 

thanks
— Hitesh


On Mar 25, 2015, at 10:24 AM, Johannes Zillmann <jz...@googlemail.com> wrote:

> Hey Hitesh, thanks for you thoughts!
> 
> In one chooses the multi-vertex approach, i guess there is no simple thing one could do to achieve n-iterations where n is flexible based on the output of the n-1 iteration.
> So you can’t do 
> - do max 20 iterationos
> - stop in case certain conditions are met
> 
> !?
> 
> Johannes
> 
> 
>> On 25 Mar 2015, at 18:05, Hitesh Shah <hi...@apache.org> wrote:
>> 
>> Hi Johannes, 
>> 
>> You would likely not avoid it if you went with the approach of multiple DAGs. For most iterative programs, you do need to checkpoint at some point. The checkpoint would likely need to be reliable to reduce the amount of re-computation needed if the check pointed data is lost. An option would be to use something like the HDFS in-memory storage tier ( which lazily persists to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a single DAG could be pre-constructed to run multiple iterations using multiple vertices and then use the final vertex of the DAG as a checkpointing mechanism after N iterations/vertices.
>> 
>> Also, depending on the amount of data being written out, the overhead of writing to HDFS may not be too high. Furthermore, with Tez sessions, there is no real overhead of launching a new DAG ( if some containers are retained ) as compared to trying to do the same with multiple MR jobs. 
>> 
>> — Hitesh
>> 
>> 
>> On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <jz...@googlemail.com> wrote:
>> 
>>> Hey Gopal,
>>> 
>>>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Iterative algorithms are expressed as DAGs in a loop.
>>>> 
>>>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>>>> paper) make that the natural way to implement that - repeated application
>>>> of the same operation over the same data, with a decision condition
>>>> determining whether to stay in the loop or not.
>>> 
>>> Can you point to a piece of code which implements this approach ?
>>> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
>>> 
>>> Johannes
>>> 
>>>> 
>>>> You might want to look at last year¹s Hadoop Summit presentations for a
>>>> direct example of Iterative algorithms with Tez.
>>>> 
>>>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
>>>> -data/25
>>>> 
>>>> 
>>>> Logistic regression needs you to use a library which implements that
>>>> specific algorithm [1].
>>>> 
>>>> On that note, something which needs incremental iteration can probably be
>>>> even more efficient in Tez than these approaches if you unroll the
>>>> iteration as 1-1 edges all of the final tasks ending up generating outputs.
>>>> 
>>>> Cheers,
>>>> Gopal
>>>> [1] - https://github.com/myui/hivemall#regression
>>>> 
>>>> 
>>>> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> from the PhD Disseration
>>>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>>>>> Matei
>>>>> Zaharia, there are four computation models in the large scale clusters:
>>>>> 
>>>>> 
>>>>> 1. *Iterative algorithm*, such as graph processing and machine leaning
>>>>> algorithm
>>>>> 2. *Relational query*
>>>>> 3. *MapReduce*, a general parallel computation model
>>>>> 4. *Stream processing*,
>>>>> 
>>>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>>>> examples.
>>>>> 
>>>>> As for streaming, I guess if we implement appropriate input,  there is no
>>>>> reason that tez can't support in theory.
>>>>> 
>>>>> But for Machine Leaning, how do we use vertex and edge to express
>>>>> *Logistic
>>>>> Regression*?
>>>>> 
>>>>> Thanks
>>>>> Chang
>>>> 
>>>> 
>>> 
>> 
>

RE: Which computation model does Tez supports

Posted by Bikas Saha <bi...@hortonworks.com>.

There are 2 modes of doing iterations.

1) Client controlled iterations (aka Spark) - where the client runs a DAG with 1 or more iterations and upon their completion, the client determines the termination condition. If the condition is not met then the client submits more DAGs until it is so. With Tez session and shared object model to share data between tasks, this model can be efficiently supported. The final write of each set of iterations has to be in some distributed store, e.g. HDFS. We have seen this model work with comparable performance to Spark (when doing tests on an experimental prototype of Spark with Tez). 1-1 edges can be used to make per vertex iterations really fast.

2) Job controlled iterations (aka Flink) - where the job itself determines the termination state and adds more iterations as needed. This is currently not support in Tez (addition of vertices or early DAG exit) but there are jiras open for those items.

Bikas

-----Original Message-----
From: Johannes Zillmann [mailto:jzillmann@googlemail.com] 
Sent: Wednesday, March 25, 2015 10:24 AM
To: dev@tez.apache.org
Subject: Re: Which computation model does Tez supports

Hey Hitesh, thanks for you thoughts!

In one chooses the multi-vertex approach, i guess there is no simple thing one could do to achieve n-iterations where n is flexible based on the output of the n-1 iteration.
So you can't do
- do max 20 iterationos
- stop in case certain conditions are met

!?

Johannes


> On 25 Mar 2015, at 18:05, Hitesh Shah <hi...@apache.org> wrote:
> 
> Hi Johannes,
> 
> You would likely not avoid it if you went with the approach of multiple DAGs. For most iterative programs, you do need to checkpoint at some point. The checkpoint would likely need to be reliable to reduce the amount of re-computation needed if the check pointed data is lost. An option would be to use something like the HDFS in-memory storage tier ( which lazily persists to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a single DAG could be pre-constructed to run multiple iterations using multiple vertices and then use the final vertex of the DAG as a checkpointing mechanism after N iterations/vertices.
> 
> Also, depending on the amount of data being written out, the overhead of writing to HDFS may not be too high. Furthermore, with Tez sessions, there is no real overhead of launching a new DAG ( if some containers are retained ) as compared to trying to do the same with multiple MR jobs. 
> 
> - Hitesh
> 
> 
> On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <jz...@googlemail.com> wrote:
> 
>> Hey Gopal,
>> 
>>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
>>> 
>>> Hi,
>>> 
>>> Iterative algorithms are expressed as DAGs in a loop.
>>> 
>>> The acyclic nature of DAGs, whether in Tez or Spark (since you 
>>> mention the
>>> paper) make that the natural way to implement that - repeated 
>>> application of the same operation over the same data, with a 
>>> decision condition determining whether to stay in the loop or not.
>> 
>> Can you point to a piece of code which implements this approach ?
>> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
>> 
>> Johannes
>> 
>>> 
>>> You might want to look at last year¹s Hadoop Summit presentations 
>>> for a direct example of Iterative algorithms with Tez.
>>> 
>>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-w
>>> ith-big
>>> -data/25
>>> 
>>> 
>>> Logistic regression needs you to use a library which implements that 
>>> specific algorithm [1].
>>> 
>>> On that note, something which needs incremental iteration can 
>>> probably be even more efficient in Tez than these approaches if you 
>>> unroll the iteration as 1-1 edges all of the final tasks ending up generating outputs.
>>> 
>>> Cheers,
>>> Gopal
>>> [1] - https://github.com/myui/hivemall#regression
>>> 
>>> 
>>> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
>>> 
>>>> Hi
>>>> 
>>>> from the PhD Disseration
>>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> 
>>>> of Matei Zaharia, there are four computation models in the large 
>>>> scale clusters:
>>>> 
>>>> 
>>>> 1. *Iterative algorithm*, such as graph processing and machine 
>>>> leaning algorithm 2. *Relational query* 3. *MapReduce*, a general 
>>>> parallel computation model 4. *Stream processing*,
>>>> 
>>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see 
>>>> any examples.
>>>> 
>>>> As for streaming, I guess if we implement appropriate input,  there 
>>>> is no reason that tez can't support in theory.
>>>> 
>>>> But for Machine Leaning, how do we use vertex and edge to express 
>>>> *Logistic Regression*?
>>>> 
>>>> Thanks
>>>> Chang
>>> 
>>> 
>> 
>

Re: Which computation model does Tez supports

Posted by Johannes Zillmann <jz...@googlemail.com>.

Hey Hitesh, thanks for you thoughts!

In one chooses the multi-vertex approach, i guess there is no simple thing one could do to achieve n-iterations where n is flexible based on the output of the n-1 iteration.
So you can’t do 
- do max 20 iterationos
- stop in case certain conditions are met

!?

Johannes


> On 25 Mar 2015, at 18:05, Hitesh Shah <hi...@apache.org> wrote:
> 
> Hi Johannes, 
> 
> You would likely not avoid it if you went with the approach of multiple DAGs. For most iterative programs, you do need to checkpoint at some point. The checkpoint would likely need to be reliable to reduce the amount of re-computation needed if the check pointed data is lost. An option would be to use something like the HDFS in-memory storage tier ( which lazily persists to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a single DAG could be pre-constructed to run multiple iterations using multiple vertices and then use the final vertex of the DAG as a checkpointing mechanism after N iterations/vertices.
> 
> Also, depending on the amount of data being written out, the overhead of writing to HDFS may not be too high. Furthermore, with Tez sessions, there is no real overhead of launching a new DAG ( if some containers are retained ) as compared to trying to do the same with multiple MR jobs. 
> 
> — Hitesh
> 
> 
> On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <jz...@googlemail.com> wrote:
> 
>> Hey Gopal,
>> 
>>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
>>> 
>>> Hi,
>>> 
>>> Iterative algorithms are expressed as DAGs in a loop.
>>> 
>>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>>> paper) make that the natural way to implement that - repeated application
>>> of the same operation over the same data, with a decision condition
>>> determining whether to stay in the loop or not.
>> 
>> Can you point to a piece of code which implements this approach ?
>> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
>> 
>> Johannes
>> 
>>> 
>>> You might want to look at last year¹s Hadoop Summit presentations for a
>>> direct example of Iterative algorithms with Tez.
>>> 
>>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
>>> -data/25
>>> 
>>> 
>>> Logistic regression needs you to use a library which implements that
>>> specific algorithm [1].
>>> 
>>> On that note, something which needs incremental iteration can probably be
>>> even more efficient in Tez than these approaches if you unroll the
>>> iteration as 1-1 edges all of the final tasks ending up generating outputs.
>>> 
>>> Cheers,
>>> Gopal
>>> [1] - https://github.com/myui/hivemall#regression
>>> 
>>> 
>>> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
>>> 
>>>> Hi
>>>> 
>>>> from the PhD Disseration
>>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>>>> Matei
>>>> Zaharia, there are four computation models in the large scale clusters:
>>>> 
>>>> 
>>>> 1. *Iterative algorithm*, such as graph processing and machine leaning
>>>> algorithm
>>>> 2. *Relational query*
>>>> 3. *MapReduce*, a general parallel computation model
>>>> 4. *Stream processing*,
>>>> 
>>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>>> examples.
>>>> 
>>>> As for streaming, I guess if we implement appropriate input,  there is no
>>>> reason that tez can't support in theory.
>>>> 
>>>> But for Machine Leaning, how do we use vertex and edge to express
>>>> *Logistic
>>>> Regression*?
>>>> 
>>>> Thanks
>>>> Chang
>>> 
>>> 
>> 
>

Re: Which computation model does Tez supports

Posted by Hitesh Shah <hi...@apache.org>.

Hi Johannes, 

You would likely not avoid it if you went with the approach of multiple DAGs. For most iterative programs, you do need to checkpoint at some point. The checkpoint would likely need to be reliable to reduce the amount of re-computation needed if the check pointed data is lost. An option would be to use something like the HDFS in-memory storage tier ( which lazily persists to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a single DAG could be pre-constructed to run multiple iterations using multiple vertices and then use the final vertex of the DAG as a checkpointing mechanism after N iterations/vertices.

Also, depending on the amount of data being written out, the overhead of writing to HDFS may not be too high. Furthermore, with Tez sessions, there is no real overhead of launching a new DAG ( if some containers are retained ) as compared to trying to do the same with multiple MR jobs. 

— Hitesh

On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <jz...@googlemail.com> wrote:

> Hey Gopal,
> 
>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
>> 
>> Hi,
>> 
>> Iterative algorithms are expressed as DAGs in a loop.
>> 
>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>> paper) make that the natural way to implement that - repeated application
>> of the same operation over the same data, with a decision condition
>> determining whether to stay in the loop or not.
> 
> Can you point to a piece of code which implements this approach ?
> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
> 
> Johannes
> 
>> 
>> You might want to look at last year¹s Hadoop Summit presentations for a
>> direct example of Iterative algorithms with Tez.
>> 
>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
>> -data/25
>> 
>> 
>> Logistic regression needs you to use a library which implements that
>> specific algorithm [1].
>> 
>> On that note, something which needs incremental iteration can probably be
>> even more efficient in Tez than these approaches if you unroll the
>> iteration as 1-1 edges all of the final tasks ending up generating outputs.
>> 
>> Cheers,
>> Gopal
>> [1] - https://github.com/myui/hivemall#regression
>> 
>> 
>> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
>> 
>>> Hi
>>> 
>>> from the PhD Disseration
>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>>> Matei
>>> Zaharia, there are four computation models in the large scale clusters:
>>> 
>>> 
>>> 1. *Iterative algorithm*, such as graph processing and machine leaning
>>> algorithm
>>> 2. *Relational query*
>>> 3. *MapReduce*, a general parallel computation model
>>> 4. *Stream processing*,
>>> 
>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>> examples.
>>> 
>>> As for streaming, I guess if we implement appropriate input,  there is no
>>> reason that tez can't support in theory.
>>> 
>>> But for Machine Leaning, how do we use vertex and edge to express
>>> *Logistic
>>> Regression*?
>>> 
>>> Thanks
>>> Chang
>> 
>> 
>

Re: Which computation model does Tez supports

Posted by Tsuyoshi Ozawa <oz...@apache.org>.

Hivemall has a MixServer, a external Key-value store, for exchanging
messages over map tasks.

https://github.com/myui/hivemall/tree/master/src/main/java/hivemall/mix


FYI, Optimus tries to express iteration by rewriting DAGs at runtime.

http://research.microsoft.com/en-us/projects/optimus/
http://research.microsoft.com/pubs/185714/Optimus.pptx

On Wed, Mar 25, 2015 at 6:02 PM, Johannes Zillmann
<jz...@googlemail.com> wrote:
> Hey Gopal,
>
>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
>>
>> Hi,
>>
>> Iterative algorithms are expressed as DAGs in a loop.
>>
>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>> paper) make that the natural way to implement that - repeated application
>> of the same operation over the same data, with a decision condition
>> determining whether to stay in the loop or not.
>
> Can you point to a piece of code which implements this approach ?
> If you each look operation is a single DAG, how would that avoid hdfs barrier ?
>
> Johannes
>
>>
>> You might want to look at last year¹s Hadoop Summit presentations for a
>> direct example of Iterative algorithms with Tez.
>>
>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
>> -data/25
>>
>>
>> Logistic regression needs you to use a library which implements that
>> specific algorithm [1].
>>
>> On that note, something which needs incremental iteration can probably be
>> even more efficient in Tez than these approaches if you unroll the
>> iteration as 1-1 edges all of the final tasks ending up generating outputs.
>>
>> Cheers,
>> Gopal
>> [1] - https://github.com/myui/hivemall#regression
>>
>>
>> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> from the PhD Disseration
>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>>> Matei
>>> Zaharia, there are four computation models in the large scale clusters:
>>>
>>>
>>>  1. *Iterative algorithm*, such as graph processing and machine leaning
>>>  algorithm
>>>  2. *Relational query*
>>>  3. *MapReduce*, a general parallel computation model
>>>  4. *Stream processing*,
>>>
>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>> examples.
>>>
>>> As for streaming, I guess if we implement appropriate input,  there is no
>>> reason that tez can't support in theory.
>>>
>>> But for Machine Leaning, how do we use vertex and edge to express
>>> *Logistic
>>> Regression*?
>>>
>>> Thanks
>>> Chang
>>
>>
>

Re: Which computation model does Tez supports

Posted by Johannes Zillmann <jz...@googlemail.com>.

Hey Gopal,

> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <go...@apache.org> wrote:
> 
> Hi,
> 
> Iterative algorithms are expressed as DAGs in a loop.
> 
> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
> paper) make that the natural way to implement that - repeated application
> of the same operation over the same data, with a decision condition
> determining whether to stay in the loop or not.

Can you point to a piece of code which implements this approach ?
If you each look operation is a single DAG, how would that avoid hdfs barrier ?

Johannes

> 
> You might want to look at last year¹s Hadoop Summit presentations for a
> direct example of Iterative algorithms with Tez.
> 
> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
> -data/25
> 
> 
> Logistic regression needs you to use a library which implements that
> specific algorithm [1].
> 
> On that note, something which needs incremental iteration can probably be
> even more efficient in Tez than these approaches if you unroll the
> iteration as 1-1 edges all of the final tasks ending up generating outputs.
> 
> Cheers,
> Gopal
> [1] - https://github.com/myui/hivemall#regression
> 
> 
> On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:
> 
>> Hi
>> 
>> from the PhD Disseration
>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>> Matei
>> Zaharia, there are four computation models in the large scale clusters:
>> 
>> 
>>  1. *Iterative algorithm*, such as graph processing and machine leaning
>>  algorithm
>>  2. *Relational query*
>>  3. *MapReduce*, a general parallel computation model
>>  4. *Stream processing*,
>> 
>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>> examples.
>> 
>> As for streaming, I guess if we implement appropriate input,  there is no
>> reason that tez can't support in theory.
>> 
>> But for Machine Leaning, how do we use vertex and edge to express
>> *Logistic
>> Regression*?
>> 
>> Thanks
>> Chang
> 
>

Re: Which computation model does Tez supports

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

Iterative algorithms are expressed as DAGs in a loop.

The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
paper) make that the natural way to implement that - repeated application
of the same operation over the same data, with a decision condition
determining whether to stay in the loop or not.

You might want to look at last year¹s Hadoop Summit presentations for a
direct example of Iterative algorithms with Tez.

http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
-data/25

Logistic regression needs you to use a library which implements that
specific algorithm [1].

On that note, something which needs incremental iteration can probably be
even more efficient in Tez than these approaches if you unroll the
iteration as 1-1 edges all of the final tasks ending up generating outputs.

Cheers,
Gopal
[1] - https://github.com/myui/hivemall#regression

On 3/24/15, 8:43 PM, "Chang Chen" <ba...@gmail.com> wrote:

>Hi
>
>from the PhD Disseration
><http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>Matei
>Zaharia, there are four computation models in the large scale clusters:
>
>
>   1. *Iterative algorithm*, such as graph processing and machine leaning
>   algorithm
>   2. *Relational query*
>   3. *MapReduce*, a general parallel computation model
>   4. *Stream processing*,
>
>Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>examples.
>
>As for streaming, I guess if we implement appropriate input,  there is no
>reason that tez can't support in theory.
>
>But for Machine Leaning, how do we use vertex and edge to express
>*Logistic
>Regression*?
>
>Thanks
>Chang