You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Rohit Kochar <mn...@gmail.com> on 2014/01/10 12:10:05 UTC

Use of Tez sessions in traditional MR jobs

Hello all,

We are exploring Tez to solve our usecase and need some inputs on the same.
In our use case we launch small small MR jobs at frequent intervals.
By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
These job have the same binary just that they run with different inputs each time.
Would using tez sessions help us in this case?
Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?

Thanks
Rohit

RE: Use of Tez sessions in traditional MR jobs

Posted by Bikas Saha <bi...@hortonworks.com>.

Tez only runs on YARN. So you need to have a Hadoop 2.0 cluster with YARN
in order to run Tez.

If that is the case, please follow install instructions to setup Tez on
your cluster. You will find them in the release artifact. Then in your
mapred-site.xml change the framework configuration to yarn-tez (it should
already be yarn). After that when your run your MR jobs they will run on
Tez instead of MR. In general, for the straight MR case, there isn't much
performance improvement that you will see since MR is already optimized
for what it does.

MR on Tez does not use sessions because MR is batch and does not benefit
from sessions. In your case if you did use sessions then (depending on
your session settings) you may be able to reuse the containers and avoid
1-3 seconds of JVM launch overhead. However, someone will need to add a
layer on the MR-upon-Tez code for multiple jobs to be submitted to the
same session.


-----Original Message-----
From: Rohit Kochar [mailto:mnit.rohit@gmail.com]
Sent: Friday, January 10, 2014 3:10 AM
To: user@tez.incubator.apache.org
Subject: Use of Tez sessions in traditional MR jobs

Hello all,

We are exploring Tez to solve our usecase and need some inputs on the
same.
In our use case we launch small small MR jobs at frequent intervals.
By small i mean jobs take under 30 secs to complete and we launch multiple
such jobs every minute.
These job have the same binary just that they run with different inputs
each time.
Would using tez sessions help us in this case?
Is there a way(minimal code change) to run the traditional MR jobs on tez
and also use tez sessions across jobs?

Thanks
Rohit

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Use of Tez sessions in traditional MR jobs

Posted by Hitesh Shah <hi...@apache.org>.

Hello Rohit 

> 
> I also wanted to check if  there a possibilty of just enabling container re-use without chaning my job client??
> 
> And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???
> 

I would probably suggest switching the mapreduce.framework.name to "yarn-tez" in mapred-site and just running your job against the tez execution engine. This will allow you to use the container re-use functionality without changing your jobclient. 

The MR layer - there is still quite some work left mainly on the command-line tools integration as well as history. However, for the most part, it should be functional barring any undiscovered bugs.

You can use the attached file as a reference doc for tez configs.

Re: Use of Tez sessions in traditional MR jobs

Posted by Hitesh Shah <hi...@apache.org>.

Yes - starting multiple sessions in parallel should work. I would behave in the same way for a case where there are multiple users starting one session each in parallel. The only gotcha is that you are dependent on the cluster/YARN settings which may or may not allow enough application masters to be launched. One thing to note is that with container re-use ( and the appropriate session timeout ), the AM will keep containers around. Therefore, the amount of parallelism you get will be dependent on the YARN scheduling mechanism, size of the cluster and how many resources each different session uses/retains. 

-- Hitesh



On Feb 12, 2014, at 2:52 AM, Rohit Kochar wrote:

> Hello,
> I understand that if i submit multiple DAGs to a sessions than all of them will execute in sequence but is it possible to start multiple sessions in parallel??
> What i am trying to achieve here is starting multiple sessions in parallel and than submit jobs to each so that jobs can execute in parallel.
> 
> Thanks
> Rohit
> 
> -- 
> Rohit Kochar
> Sent with Sparrow
> 
> On Monday, 27 January 2014 at 11:21 pm, Hitesh Shah wrote:
> 
>> Hello Rohit,
>> 
>> To add a third party jar to the classpath of a vertex, you need to use LocalResources.
>> 
>> For this, you will need to upload the jar to HDFS and create a LocalResource instance pointing the path on HDFS. By default, LocalResource of type FILE will be available in the working directory of the Vertex's tasks and given that ./* is already in the classpath, you will not need to do anything additional with respect to classpath modification. For example code for the above, look at TezClientUtils::setupTezJarsLocalResources().
>> 
>> Also, if the 3rd party jars do not change across jobs, a further optimization would be to have an admin upload the jars only once to HDFS and just reuse the same HDFS file/path across jobs.
>> 
>> thanks
>> -- Hitesh
>> 
>> 
>> On Jan 27, 2014, at 1:57 AM, Rohit Kochar wrote:
>> 
>>> As suggested by everbody in order to use Tez Sessions i changed my current MR job to create a TEZ DAG and than submit it to the session(following steps similar to OrderedWordCount Example).
>>> One thing that i am not able to figure out is a way to add a third party jar to the class path of a vertex.
>>> 
>>> In traditional MR job the way to achieve this was to copy the jar to a hdfs location and than add that path as “tmpjars” in the job conf.
>>> Is there a similar config for TEZ Dag as well?
>>> Thanks
>>> Rohit
>>> 
>>> On 13-Jan-2014, at 11:17 pm, Bikas Saha <bi...@hortonworks.com> wrote:
>>> 
>>>> Please be sure that you are using 0.2 release of Tez. That is the one that has all these features.
>>>> Container reuse can be enabled via client side config in tez-site.xml. There are multiple configs. So please look at documentation. If the docs are not clear then please file a jira and we will fix the docs.
>>>> Yes, if you can change your pipeline to submit these jobs to Tez sessions then you may see significant reduction in overhead costs.
>>>> Sessions don’t currently support running multiple DAGs in parallel. So if you are running multiple jobs in parallel then you may not see some of the overhead reductions. However, if all those concurrent jobs can be grouped in 1 DAG (1 DAG containing a group of MR jobs) then you will see all the benefits.
>>>> There is a parallel email thread in which Jonathan has asked for a list of MR support I remember seeing a reply to that thread that summarizes this. Hitesh?
>>>> Barring a few things all your MR jobs should still run perfectly fine on Tez. Why don’t you give it a shot and let us know if they don’t.
>>>> Bikas
>>>> From: Rohit Kochar [mailto:mnit.rohit@gmail.com]
>>>> Sent: Monday, January 13, 2014 1:38 AM
>>>> To: user@tez.incubator.apache.org
>>>> Subject: Re: Use of Tez sessions in traditional MR jobs
>>>> Thanks Hitesh,Siddharth and Bikas for the detailed reply.
>>>> These are some more details about our use case:
>>>> "Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? “
>>>> Ans:No i haven’t yet tried running my actual application with Map reduce on Tez,though have tried the examples that ship along with the code.
>>>> "How many of the jobs runs in parallel, do they end up utilizing all capacity on your cluster, or part of it ?"
>>>> Ans: On an average around 10 jobs run in parallel and they just occupy a part of our cluster.
>>>> Since our jobs are really small thats why even we are exploring Tez to save on the task launch and cleanup time.
>>>> As everybody of you suggested to use sessions i need to change my existing MR jobs to Tez DAGs,i would try that and would give an update on the same.
>>>> I also wanted to check if there a possibilty of just enabling container re-use without chaning my job client??
>>>> And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???
>>>> Thanks
>>>> Rohit
>>>> On 11-Jan-2014, at 12:49 am, Hitesh Shah <hi...@apache.org> wrote:
>>>> 
>>>> 
>>>> Hello Rohit
>>>> 
>>>> You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue.
>>>> 
>>>> There are 2 possible options we could try though both require changes:
>>>> 
>>>> i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward.
>>>> 
>>>> Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG().
>>>> 
>>>> Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes.
>>>> 
>>>> ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.
>>>> 
>>>> Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez. 
>>>> 
>>>> thanks
>>>> -- Hitesh
>>>> 
>>>> On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:
>>>> 
>>>> 
>>>> Hello all,
>>>> 
>>>> We are exploring Tez to solve our usecase and need some inputs on the same.
>>>> In our use case we launch small small MR jobs at frequent intervals.
>>>> By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
>>>> These job have the same binary just that they run with different inputs each time.
>>>> Would using tez sessions help us in this case?
>>>> Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
>>>> 
>>>> Thanks
>>>> Rohit
>>>> 
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
>

Re: Use of Tez sessions in traditional MR jobs

Posted by Rohit Kochar <mn...@gmail.com>.

Hello,
I understand that if i submit multiple DAGs to a sessions than all of them will execute in sequence but is it possible to start multiple sessions in parallel??
What i am trying to achieve here is starting multiple sessions in parallel and than submit jobs to each so that jobs can execute in parallel.

Thanks
Rohit


--  
Rohit Kochar
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, 27 January 2014 at 11:21 pm, Hitesh Shah wrote:

> Hello Rohit,  
>  
> To add a third party jar to the classpath of a vertex, you need to use LocalResources.  
>  
> For this, you will need to upload the jar to HDFS and create a LocalResource instance pointing the path on HDFS. By default, LocalResource of type FILE will be available in the working directory of the Vertex's tasks and given that ./* is already in the classpath, you will not need to do anything additional with respect to classpath modification. For example code for the above, look at TezClientUtils::setupTezJarsLocalResources().
>  
> Also, if the 3rd party jars do not change across jobs, a further optimization would be to have an admin upload the jars only once to HDFS and just reuse the same HDFS file/path across jobs.
>  
> thanks
> -- Hitesh
>  
>  
> On Jan 27, 2014, at 1:57 AM, Rohit Kochar wrote:
>  
> > As suggested by everbody in order to use Tez Sessions i changed my current MR job to create a TEZ DAG and than submit it to the session(following steps similar to OrderedWordCount Example).
> > One thing that i am not able to figure out is a way to add a third party jar to the class path of a vertex.
> >  
> > In traditional MR job the way to achieve this was to copy the jar to a hdfs location and than add that path as “tmpjars” in the job conf.
> > Is there a similar config for TEZ Dag as well?
> >  
> > Thanks
> > Rohit  
> >  
> > On 13-Jan-2014, at 11:17 pm, Bikas Saha <bikas@hortonworks.com (mailto:bikas@hortonworks.com)> wrote:
> >  
> > > Please be sure that you are using 0.2 release of Tez. That is the one that has all these features.
> > >  
> > > Container reuse can be enabled via client side config in tez-site.xml. There are multiple configs. So please look at documentation. If the docs are not clear then please file a jira and we will fix the docs.
> > >  
> > > Yes, if you can change your pipeline to submit these jobs to Tez sessions then you may see significant reduction in overhead costs.
> > >  
> > > Sessions don’t currently support running multiple DAGs in parallel. So if you are running multiple jobs in parallel then you may not see some of the overhead reductions. However, if all those concurrent jobs can be grouped in 1 DAG (1 DAG containing a group of MR jobs) then you will see all the benefits.
> > >  
> > > There is a parallel email thread in which Jonathan has asked for a list of MR support I remember seeing a reply to that thread that summarizes this. Hitesh?
> > >  
> > > Barring a few things all your MR jobs should still run perfectly fine on Tez. Why don’t you give it a shot and let us know if they don’t.
> > >  
> > > Bikas
> > >  
> > > From: Rohit Kochar [mailto:mnit.rohit@gmail.com]  
> > > Sent: Monday, January 13, 2014 1:38 AM
> > > To: user@tez.incubator.apache.org (mailto:user@tez.incubator.apache.org)
> > > Subject: Re: Use of Tez sessions in traditional MR jobs
> > >  
> > > Thanks Hitesh,Siddharth and Bikas for the detailed reply.
> > > These are some more details about our use case:
> > >  
> > > "Have you tried running your current MR job using mapreduce.framework.name (http://mapreduce.framework.name) to "yarn-tez" to see how much of a difference Tez makes when re-using containers? “
> > > Ans:No i haven’t yet tried running my actual application with Map reduce on Tez,though have tried the examples that ship along with the code.
> > >  
> > > "How many of the jobs runs in parallel, do they end up utilizing all capacity on your cluster, or part of it ?"
> > > Ans: On an average around 10 jobs run in parallel and they just occupy a part of our cluster.
> > > Since our jobs are really small thats why even we are exploring Tez to save on the task launch and cleanup time.
> > >  
> > > As everybody of you suggested to use sessions i need to change my existing MR jobs to Tez DAGs,i would try that and would give an update on the same.
> > >  
> > > I also wanted to check if there a possibilty of just enabling container re-use without chaning my job client??
> > >  
> > > And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???
> > >  
> > > Thanks
> > > Rohit
> > >  
> > >  
> > >  
> > > On 11-Jan-2014, at 12:49 am, Hitesh Shah <hitesh@apache.org (mailto:hitesh@apache.org)> wrote:
> > >  
> > >  
> > > Hello Rohit  
> > >  
> > > You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue.  
> > >  
> > > There are 2 possible options we could try though both require changes:
> > >  
> > > i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward.  
> > >  
> > > Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG().  
> > >  
> > > Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes.  
> > >  
> > > ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name (http://mapreduce.framework.name) to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.
> > >  
> > > Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez.  
> > >  
> > > thanks
> > > -- Hitesh  
> > >  
> > > On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:
> > >  
> > >  
> > > Hello all,
> > >  
> > > We are exploring Tez to solve our usecase and need some inputs on the same.
> > > In our use case we launch small small MR jobs at frequent intervals.
> > > By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
> > > These job have the same binary just that they run with different inputs each time.
> > > Would using tez sessions help us in this case?
> > > Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
> > >  
> > > Thanks
> > > Rohit
> > >  
> > >  
> > >  
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
> > >  
> >  
> >  
>  
>  
>

Re: Use of Tez sessions in traditional MR jobs

Posted by Hitesh Shah <hi...@apache.org>.

Hello Rohit, 

To add a third party jar to the classpath of a vertex, you need to use LocalResources. 

For this, you will need to upload the jar to HDFS and create a LocalResource instance pointing the path on HDFS. By default, LocalResource of type FILE will be available in the working directory of the Vertex's tasks  and given that ./* is already in the classpath, you will not need to do anything additional with respect to classpath modification. For example code for the above, look at TezClientUtils::setupTezJarsLocalResources().

Also, if the 3rd party jars do not change across jobs, a further optimization would be to have an admin upload the jars only once to HDFS and just reuse the same HDFS file/path across jobs.

thanks
-- Hitesh


On Jan 27, 2014, at 1:57 AM, Rohit Kochar wrote:

> As suggested by everbody in order to use Tez Sessions i changed my current MR job to create a TEZ DAG and than submit it to the session(following steps similar to OrderedWordCount Example).
> One thing that i am not able to figure out is a way to add a third party jar to the class path of a vertex.
> 
> In traditional MR job the way to achieve this was to copy the jar to a hdfs location and than add that path as “tmpjars” in the job conf.
> Is there a similar config for TEZ Dag as well?
>   
> Thanks
> Rohit 
> 
> On 13-Jan-2014, at 11:17 pm, Bikas Saha <bi...@hortonworks.com> wrote:
> 
>> Please be sure that you are using 0.2 release of Tez. That is the one that has all these features.
>>  
>> Container reuse can be enabled via client side config in tez-site.xml. There are multiple configs. So please look at documentation. If the docs are not clear then please file a jira and we will fix the docs.
>>  
>> Yes, if you can change your pipeline to submit these jobs to Tez sessions then you may see significant reduction in overhead costs.
>>  
>> Sessions don’t currently support running multiple DAGs in parallel. So if you are running multiple jobs in parallel then you may not see some of the overhead reductions. However, if all those concurrent jobs can be grouped in 1 DAG (1 DAG containing a group of MR jobs) then you will see all the benefits.
>>  
>> There is a parallel email thread in which Jonathan has asked for a list of MR support I remember seeing a reply to that thread that summarizes this. Hitesh?
>>  
>> Barring a few things all your MR jobs should still run perfectly fine on Tez. Why don’t you give it a shot and let us know if they don’t.
>>  
>> Bikas
>>  
>> From: Rohit Kochar [mailto:mnit.rohit@gmail.com] 
>> Sent: Monday, January 13, 2014 1:38 AM
>> To: user@tez.incubator.apache.org
>> Subject: Re: Use of Tez sessions in traditional MR jobs
>>  
>> Thanks Hitesh,Siddharth and Bikas for the detailed reply.
>> These are some more details about our use case:
>>  
>> "Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? “
>> Ans:No i haven’t yet tried running my actual application with Map reduce on Tez,though have tried the examples that ship along with the code.
>>  
>> "How many of the jobs runs in parallel, do they end up utilizing all capacity on your cluster, or part of it ?"
>> Ans: On an average around 10 jobs run in parallel and they just occupy a part of our cluster.
>> Since our jobs are really small thats why even we are exploring Tez to save on the task launch and cleanup time.
>>  
>> As everybody of you suggested to use sessions i need to change my existing MR jobs to Tez DAGs,i would try that and would give an update on the same.
>>  
>> I also wanted to check if  there a possibilty of just enabling container re-use without chaning my job client??
>>  
>> And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???
>>  
>> Thanks
>> Rohit
>>  
>>  
>>  
>> On 11-Jan-2014, at 12:49 am, Hitesh Shah <hi...@apache.org> wrote:
>> 
>> 
>> Hello Rohit 
>> 
>> You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue. 
>> 
>> There are 2 possible options we could try though both require changes:
>> 
>> i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward. 
>> 
>> Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG(). 
>> 
>> Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes. 
>> 
>> ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.
>> 
>> Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez. 
>> 
>> thanks
>> -- Hitesh 
>> 
>> On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:
>> 
>> 
>> Hello all,
>> 
>> We are exploring Tez to solve our usecase and need some inputs on the same.
>> In our use case we launch small small MR jobs at frequent intervals.
>> By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
>> These job have the same binary just that they run with different inputs each time.
>> Would using tez sessions help us in this case?
>> Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
>> 
>> Thanks
>> Rohit
>>  
>>  
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
>

Re: Use of Tez sessions in traditional MR jobs

Posted by Rohit Kochar <mn...@gmail.com>.

As suggested by everbody in order to use Tez Sessions i changed my current MR job to create a TEZ DAG and than submit it to the session(following steps similar to OrderedWordCount Example).
One thing that i am not able to figure out is a way to add a third party jar to the class path of a vertex.

In traditional MR job the way to achieve this was to copy the jar to a hdfs location and than add that path as “tmpjars” in the job conf.
Is there a similar config for TEZ Dag as well?
  
Thanks
Rohit 

On 13-Jan-2014, at 11:17 pm, Bikas Saha <bi...@hortonworks.com> wrote:

> Please be sure that you are using 0.2 release of Tez. That is the one that has all these features.
>  
> Container reuse can be enabled via client side config in tez-site.xml. There are multiple configs. So please look at documentation. If the docs are not clear then please file a jira and we will fix the docs.
>  
> Yes, if you can change your pipeline to submit these jobs to Tez sessions then you may see significant reduction in overhead costs.
>  
> Sessions don’t currently support running multiple DAGs in parallel. So if you are running multiple jobs in parallel then you may not see some of the overhead reductions. However, if all those concurrent jobs can be grouped in 1 DAG (1 DAG containing a group of MR jobs) then you will see all the benefits.
>  
> There is a parallel email thread in which Jonathan has asked for a list of MR support I remember seeing a reply to that thread that summarizes this. Hitesh?
>  
> Barring a few things all your MR jobs should still run perfectly fine on Tez. Why don’t you give it a shot and let us know if they don’t.
>  
> Bikas
>  
> From: Rohit Kochar [mailto:mnit.rohit@gmail.com] 
> Sent: Monday, January 13, 2014 1:38 AM
> To: user@tez.incubator.apache.org
> Subject: Re: Use of Tez sessions in traditional MR jobs
>  
> Thanks Hitesh,Siddharth and Bikas for the detailed reply.
> These are some more details about our use case:
>  
> "Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? “
> Ans:No i haven’t yet tried running my actual application with Map reduce on Tez,though have tried the examples that ship along with the code.
>  
> "How many of the jobs runs in parallel, do they end up utilizing all capacity on your cluster, or part of it ?"
> Ans: On an average around 10 jobs run in parallel and they just occupy a part of our cluster.
> Since our jobs are really small thats why even we are exploring Tez to save on the task launch and cleanup time.
>  
> As everybody of you suggested to use sessions i need to change my existing MR jobs to Tez DAGs,i would try that and would give an update on the same.
>  
> I also wanted to check if  there a possibilty of just enabling container re-use without chaning my job client??
>  
> And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???
>  
> Thanks
> Rohit
>  
>  
>  
> On 11-Jan-2014, at 12:49 am, Hitesh Shah <hi...@apache.org> wrote:
> 
> 
> Hello Rohit 
> 
> You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue. 
> 
> There are 2 possible options we could try though both require changes:
> 
> i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward. 
> 
> Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG(). 
> 
> Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes. 
> 
> ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.
> 
> Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez. 
> 
> thanks
> -- Hitesh 
> 
> On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:
> 
> 
> Hello all,
> 
> We are exploring Tez to solve our usecase and need some inputs on the same.
> In our use case we launch small small MR jobs at frequent intervals.
> By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
> These job have the same binary just that they run with different inputs each time.
> Would using tez sessions help us in this case?
> Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
> 
> Thanks
> Rohit
>  
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Use of Tez sessions in traditional MR jobs

Posted by Bikas Saha <bi...@hortonworks.com>.

Please be sure that you are using 0.2 release of Tez. That is the one that
has all these features.

Container reuse can be enabled via client side config in tez-site.xml.
There are multiple configs. So please look at documentation. If the docs
are not clear then please file a jira and we will fix the docs.

Yes, if you can change your pipeline to submit these jobs to Tez sessions
then you may see significant reduction in overhead costs.

Sessions don’t currently support running multiple DAGs in parallel. So if
you are running multiple jobs in parallel then you may not see some of the
overhead reductions. However, if all those concurrent jobs can be grouped
in 1 DAG (1 DAG containing a group of MR jobs) then you will see all the
benefits.

There is a parallel email thread in which Jonathan has asked for a list of
MR support I remember seeing a reply to that thread that summarizes this.
Hitesh?

Barring a few things all your MR jobs should still run perfectly fine on
Tez. Why don’t you give it a shot and let us know if they don’t.

Bikas

*From:* Rohit Kochar [mailto:mnit.rohit@gmail.com]
*Sent:* Monday, January 13, 2014 1:38 AM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Use of Tez sessions in traditional MR jobs

Thanks Hitesh,Siddharth and Bikas for the detailed reply.

These are some more details about our use case:

"Have you tried running your current MR job using
mapreduce.framework.nameto "yarn-tez" to see how much of a difference
Tez makes when re-using
containers? “

Ans:No i haven’t yet tried running my actual application with Map reduce on
Tez,though have tried the examples that ship along with the code.

"How many of the jobs runs in parallel, do they end up utilizing all
capacity on your cluster, or part of it ?"

Ans: On an average around 10 jobs run in parallel and they just occupy a
part of our cluster.

Since our jobs are really small thats why even we are exploring Tez to save
on the task launch and cleanup time.

As everybody of you suggested to use sessions i need to change my existing
MR jobs to Tez DAGs,i would try that and would give an update on the same.

I also wanted to check if there a possibilty of just enabling container
re-use without chaning my job client??

And also as Siddharth mentioned that Map reduce on Tez is not fully feature
complete is there any document/jira which chalks down the missing features
in Map reduce on Tez???

Thanks

Rohit

On 11-Jan-2014, at 12:49 am, Hitesh Shah <hi...@apache.org> wrote:

Hello Rohit

You have quite an interesting use-case. Sessions would definitely help if
you plan to run a chain of jobs within the same AM. When re-using the AM as
well as the containers, we have seen a huge performance boost for small
jobs due to the overheads of launching the AM and new containers. However,
there is a catch here. Performance boosts come in if the session keeps hold
of containers for more re-use though at times, data locality can affect
performance if the data is large enough. Depending on your cluster
resources, it becomes a question of whether the subsequent jobs are
submitted a quick enough rate or would there be a situation where
containers are held by the session and lying idle ( whereby reducing
available resources for other jobs in the cluster ). There is a
configurable timeout of course so this can be tuned based on your needs.
Given that your job takes approx. 30 seconds and there are multiple jobs
per minute, this may not be an issue.

There are 2 possible options we could try though both require changes:

i) Would you be open to trying to convert your MR job into a Tez job? An MR
job is effectively a Tez DAG with 2 vertices connected by a shuffle edge.
Given that the Tez api is a bit low-level, it may look daunting but is
fairly straight-forward.

Take a look at the WordCount example (
https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD)
to get a general idea. The main part is DAG createDAG().

Once this is available, we would still need some changes to allow a new DAG
to be submitted to an existing Tez session. Today, all examples launch a
single process, start a session, submit multiple dags serially and close
the session when the process completes.

ii) The other option is to change the way the MR-on-Tez integration layer
works today. MR jobs inherently assume that they are launching a new YARN
application with the command line tools having made implicit assumptions
that a job id maps to a YARN application id. Have you tried running your
current MR job using mapreduce.framework.name to "yarn-tez" to see how much
of a difference Tez makes when re-using containers? In this case, too, we
would need to enhance Tez to support discovery of existing sessions to
submit a job too. The main problem in this scenario may be that the
existing command-line MR tools would not work and neither will the MR
specific job history be accessible.

Let us know if you have any more questions. If you are willing to try out
option (i), folks on this list will be able to help you migrate your MR job
into a Tez native DAG. Also, could you file a jira for this. It is a good
use-case which should be addressed in Tez.

thanks
-- Hitesh

On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:

Hello all,

We are exploring Tez to solve our usecase and need some inputs on the same.
In our use case we launch small small MR jobs at frequent intervals.
By small i mean jobs take under 30 secs to complete and we launch multiple
such jobs every minute.
These job have the same binary just that they run with different inputs
each time.
Would using tez sessions help us in this case?
Is there a way(minimal code change) to run the traditional MR jobs on tez
and also use tez sessions across jobs?

Thanks
Rohit

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Re: Use of Tez sessions in traditional MR jobs

Posted by Rohit Kochar <mn...@gmail.com>.

Thanks Hitesh,Siddharth and Bikas for the detailed reply.
These are some more details about our use case:

"Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? “
Ans:No i haven’t yet tried running my actual application with Map reduce on Tez,though have tried the examples that ship along with the code.

"How many of the jobs runs in parallel, do they end up utilizing all capacity on your cluster, or part of it ?"
Ans: On an average around 10 jobs run in parallel and they just occupy a part of our cluster.
Since our jobs are really small thats why even we are exploring Tez to save on the task launch and cleanup time.

As everybody of you suggested to use sessions i need to change my existing MR jobs to Tez DAGs,i would try that and would give an update on the same.

I also wanted to check if  there a possibilty of just enabling container re-use without chaning my job client??

And also as Siddharth mentioned that Map reduce on Tez is not fully feature complete is there any document/jira which chalks down the missing features in Map reduce on Tez???

Thanks
Rohit



On 11-Jan-2014, at 12:49 am, Hitesh Shah <hi...@apache.org> wrote:

> Hello Rohit 
> 
> You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue. 
> 
> There are 2 possible options we could try though both require changes:
> 
> i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward. 
> 
> Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG(). 
> 
> Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes. 
> 
> ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.
> 
> Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez. 
> 
> thanks
> -- Hitesh 
> 
> On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:
> 
>> Hello all,
>> 
>> We are exploring Tez to solve our usecase and need some inputs on the same.
>> In our use case we launch small small MR jobs at frequent intervals.
>> By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
>> These job have the same binary just that they run with different inputs each time.
>> Would using tez sessions help us in this case?
>> Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
>> 
>> Thanks
>> Rohit
>

Re: Use of Tez sessions in traditional MR jobs

Posted by Hitesh Shah <hi...@apache.org>.

Hello Rohit 

You have quite an interesting use-case. Sessions would definitely help if you plan to run a chain of jobs within the same AM. When re-using the AM as well as the containers, we have seen a huge performance boost for small jobs due to the overheads of launching the AM and new containers. However, there is a catch here. Performance boosts come in if the session keeps hold of containers for more re-use though at times, data locality can affect performance if the data is large enough. Depending on your cluster resources, it becomes a question of whether the subsequent jobs are submitted a quick enough rate or would there be a situation where containers are held by the session and lying idle ( whereby reducing available resources for other jobs in the cluster ). There is a configurable timeout of course so this can be tuned based on your needs. Given that your job takes approx. 30 seconds and there are multiple jobs per minute, this may not be an issue. 

There are 2 possible options we could try though both require changes:

i) Would you be open to trying to convert your MR job into a Tez job? An MR job is effectively a Tez DAG with 2 vertices connected by a shuffle edge. Given that the Tez api is a bit low-level, it may look daunting but is fairly straight-forward. 

Take a look at the WordCount example ( https://git-wip-us.apache.org/repos/asf?p=incubator-tez.git;a=blob_plain;f=tez-mapreduce-examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java;hb=HEAD ) to get a general idea. The main part is DAG createDAG(). 

Once this is available, we would still need some changes to allow a new DAG to be submitted to an existing Tez session. Today, all examples launch a single process, start a session, submit multiple dags serially and close the session when the process completes. 

ii) The other option is to change the way the MR-on-Tez integration layer works today. MR jobs inherently assume that they are launching a new YARN application with the command line tools having made implicit assumptions that a job id maps to a YARN application id. Have you tried running your current MR job using mapreduce.framework.name to "yarn-tez" to see how much of a difference Tez makes when re-using containers? In this case, too, we would need to enhance Tez to support discovery of existing sessions to submit a job too. The main problem in this scenario may be that the existing command-line MR tools would not work and neither will the MR specific job history be accessible.

Let us know if you have any more questions. If you are willing to try out option (i), folks on this list will be able to help you migrate your MR job into a Tez native DAG. Also, could you file a jira for this. It is a good use-case which should be addressed in Tez. 

thanks
-- Hitesh 

On Jan 10, 2014, at 3:10 AM, Rohit Kochar wrote:

> Hello all,
> 
> We are exploring Tez to solve our usecase and need some inputs on the same.
> In our use case we launch small small MR jobs at frequent intervals.
> By small i mean jobs take under 30 secs to complete and we launch multiple such jobs every minute.
> These job have the same binary just that they run with different inputs each time.
> Would using tez sessions help us in this case?
> Is there a way(minimal code change) to run the traditional MR jobs on tez and also use tez sessions across jobs?
> 
> Thanks
> Rohit

Re: Use of Tez sessions in traditional MR jobs

Posted by Siddharth Seth <ss...@apache.org>.

Hi Rohit

For starters, you could try running your MapReduce job using Tez - to
ensure that it works correctly. MapReduce on Tez is not 100% compatible
with traditional MapReduce - example the functionality available on the
JobClient to track individual tasks is missing.
Trying this out should be fairly straightforward - once you have Tez setup
- you just need to change “mapreduce.framework.name” (mapred-site or
command-line) to “yarn-tez”.

Sessions currently do not work with MapReduce jobs on Tez (when using
JobClient). Getting sessions to work is possible, but fairly complicated
and prone to errors because of the initialization that is already done by
MapReduce before Tez gets control.

It should be possible to run your jobs using the TezClient directly -
there’s several helper methods which can be used to perform the
functionality that is otherwise done as part of the MapReduce client.

Will need some more information to answer the question about whether
sessions will be usable for your use case or not.
How many of the jobs runs in parallel, do they end up utilizing all
capacity on your cluster, or part of it ?
If individual jobs run for 30 seconds, container re-use via sessions should
buy you a lot.

HTH
- Sid

On Fri, Jan 10, 2014 at 3:10 AM, Rohit Kochar <mn...@gmail.com> wrote:

> Hello all,
>
> We are exploring Tez to solve our usecase and need some inputs on the same.
> In our use case we launch small small MR jobs at frequent intervals.
> By small i mean jobs take under 30 secs to complete and we launch multiple
> such jobs every minute.
> These job have the same binary just that they run with different inputs
> each time.
> Would using tez sessions help us in this case?
> Is there a way(minimal code change) to run the traditional MR jobs on tez
> and also use tez sessions across jobs?
>
> Thanks
> Rohit