You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by Shiri Marron <Sh...@amdocs.com> on 2015/08/25 16:29:12 UTC

Problem when running our code with tez

Hi,

We are trying to run our existing workflows that contains pig scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we run our code with tez.

In our code, we are writing and reading from/to a temp directory which we create with a name based on the  jobID:
     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the close() -we take the jobID from TaskAttemptContext context. Meaning, each task writes a file to
           this  directory in the close () method according to the jobID from the context.
    Part 2 -  In the end of the whole job (after all the tasks were completed), we have our custom outputCommitter (which extends the

               org.apache.hadoop.mapreduce.OutputCommitter), and in the commitJob()  it looks for that directory of the job and handles all the files under it-  the jobID is taken from JobContext context.getJobID().toString()



We noticed that when we use tez, this mechanism doesn't work since the jobID from the tez task (part one ) is combined from the original job id+vertex id , for example: 14404914675610 instead of 1440491467561 . So the directory name in part 2 is different than part 1.


We looked for a way to retrieve only the vertex id or only the job id , but didn't find one - on the configuration the  property:
mapreduce.job.id also had the addition of the vertex id, and no other property value was equal to the original job id.

Can you please advise how can we solve this issue?  Is there a way to get the original jobID when we're in part 1?

Regards,
Shiri Marron
Amdocs

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: Problem when running our code with tez

Posted by Hersh Shafer <He...@amdocs.com>.

+Shiri

-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org] 
Sent: Wednesday, August 26, 2015 12:09 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

+dev@pig as this might be a question better answered by Pig developers. 

This probably won't answer your question but should give you some background info. When Pig uses Tez, it may end up running multiple dags within the same YARN application therefore the "jobId" ( in case of MR, job Id maps to the application Id from YARN ) may not be unique. Furthermore, there are cases where multiple vertices within the same DAG could write to HDFS hence both dagId and vertexId are required to guarantee uniqueness when writing to a common location. 
 
thanks
- Hitesh


On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:

> Hi,
> 
> We are trying to run our existing workflows that contains pig scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we run our code with tez.
> 
> In our code, we are writing and reading from/to a temp directory which we create with a name based on the  jobID:
>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the close() -we take the jobID from TaskAttemptContext context. Meaning, each task writes a file to
>           this  directory in the close () method according to the jobID from the context.
>    Part 2 -  In the end of the whole job (after all the tasks were 
> completed), we have our custom outputCommitter (which extends the
> 
>               org.apache.hadoop.mapreduce.OutputCommitter), and in the 
> commitJob()  it looks for that directory of the job and handles all 
> the files under it-  the jobID is taken from JobContext 
> context.getJobID().toString()
> 
> 
> 
> We noticed that when we use tez, this mechanism doesn't work since the jobID from the tez task (part one ) is combined from the original job id+vertex id , for example: 14404914675610 instead of 1440491467561 . So the directory name in part 2 is different than part 1.
> 
> 
> We looked for a way to retrieve only the vertex id or only the job id , but didn't find one - on the configuration the  property:
> mapreduce.job.id also had the addition of the vertex id, and no other property value was equal to the original job id.
> 
> Can you please advise how can we solve this issue?  Is there a way to get the original jobID when we're in part 1?
> 
> Regards,
> Shiri Marron
> Amdocs
> 
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement, you may 
> review at http://www.amdocs.com/email_disclaimer.asp


This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Problem when running our code with tez

Posted by Rohini Palaniswamy <ro...@gmail.com>.

> A possible
solution is to use conf.get(³mapreduce.workflow.id²) +
conf.get(³mapreduce.workflow.node.name²)
  Daniel, currently they are set only in vertex conf and will not be
available for MROutput.

Shiri,
   Can you tell your actual usecase which lead to implementing a
RecordWriter which writes to a jobID directory? Looks like you want to
write to a temporary directory and do some custom processing before
committing them. Are you committing to some external directory than actual
output directory which requires you to use jobID directory instead of the
_temporary mapreduce uses in general.

Regards,
Rohini

On Sun, Aug 30, 2015 at 2:20 AM, Shiri Marron <Sh...@amdocs.com>
wrote:

> +Nir
>
> -----Original Message-----
> From: Hersh Shafer
> Sent: Thursday, August 27, 2015 11:45 AM
> To: Daniel Dai; dev@tez.apache.org; dev@pig.apache.org; Shiri Marron
> Cc: Almog Shunim
> Subject: RE: Problem when running our code with tez
>
> +Shiri
>
> -----Original Message-----
> From: Daniel Dai [mailto:daijy@hortonworks.com]
> Sent: Wednesday, August 26, 2015 1:57 AM
> To: dev@tez.apache.org; dev@pig.apache.org
> Cc: Hersh Shafer; Almog Shunim
> Subject: Re: Problem when running our code with tez
>
> JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a
> way you can get DagId within RecordWriter/OutputCommitter. A possible
> solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³
> mapreduce.workflow.node.name²). Note both are Pig specific configuration
> and only applicable if you run with Pig.
>
> Daniel
>
>
>
>
> On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:
>
> >+dev@pig as this might be a question better answered by Pig developers.
> >
> >This probably won¹t answer your question but should give you some
> >background info. When Pig uses Tez, it may end up running multiple dags
> >within the same YARN application therefore the ³jobId² ( in case of MR,
> >job Id maps to the application Id from YARN ) may not be unique.
> >Furthermore, there are cases where multiple vertices within the same
> >DAG could write to HDFS hence both dagId and vertexId are required to
> >guarantee uniqueness when writing to a common location.
> >
> >thanks
> >< Hitesh
> >
> >
> >On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com>
> wrote:
> >
> >> Hi,
> >>
> >> We are trying to run our existing workflows that contains pig
> >>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are
> >>facing some problems when we run our code with tez.
> >>
> >> In our code, we are writing and reading from/to a temp directory
> >>which we create with a name based on the  jobID:
> >>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and
> >>in the close() -we take the jobID from TaskAttemptContext context.
> >>Meaning, each task writes a file to
> >>           this  directory in the close () method according to the
> >>jobID from the context.
> >>    Part 2 -  In the end of the whole job (after all the tasks were
> >>completed), we have our custom outputCommitter (which extends the
> >>
> >>               org.apache.hadoop.mapreduce.OutputCommitter), and in
> >>the
> >>commitJob()  it looks for that directory of the job and handles all
> >>the files under it-  the jobID is taken from JobContext
> >>context.getJobID().toString()
> >>
> >>
> >>
> >> We noticed that when we use tez, this mechanism doesn't work since
> >>the jobID from the tez task (part one ) is combined from the original
> >>job
> >>id+vertex id , for example: 14404914675610 instead of 1440491467561 .
> >>id+So
> >>the directory name in part 2 is different than part 1.
> >>
> >>
> >> We looked for a way to retrieve only the vertex id or only the job id
> >>, but didn't find one - on the configuration the  property:
> >> mapreduce.job.id also had the addition of the vertex id, and no other
> >>property value was equal to the original job id.
> >>
> >> Can you please advise how can we solve this issue?  Is there a way to
> >>get the original jobID when we're in part 1?
> >>
> >> Regards,
> >> Shiri Marron
> >> Amdocs
> >>
> >> This message and the information contained herein is proprietary and
> >>confidential and subject to the Amdocs policy statement,  you may
> >>review at http://www.amdocs.com/email_disclaimer.asp
> >
> >
>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp
>

Re: Problem when running our code with tez

Posted by Rohini Palaniswamy <ro...@gmail.com>.

> A possible
solution is to use conf.get(³mapreduce.workflow.id²) +
conf.get(³mapreduce.workflow.node.name²)
  Daniel, currently they are set only in vertex conf and will not be
available for MROutput.

Shiri,
   Can you tell your actual usecase which lead to implementing a
RecordWriter which writes to a jobID directory? Looks like you want to
write to a temporary directory and do some custom processing before
committing them. Are you committing to some external directory than actual
output directory which requires you to use jobID directory instead of the
_temporary mapreduce uses in general.

Regards,
Rohini

On Sun, Aug 30, 2015 at 2:20 AM, Shiri Marron <Sh...@amdocs.com>
wrote:

> +Nir
>
> -----Original Message-----
> From: Hersh Shafer
> Sent: Thursday, August 27, 2015 11:45 AM
> To: Daniel Dai; dev@tez.apache.org; dev@pig.apache.org; Shiri Marron
> Cc: Almog Shunim
> Subject: RE: Problem when running our code with tez
>
> +Shiri
>
> -----Original Message-----
> From: Daniel Dai [mailto:daijy@hortonworks.com]
> Sent: Wednesday, August 26, 2015 1:57 AM
> To: dev@tez.apache.org; dev@pig.apache.org
> Cc: Hersh Shafer; Almog Shunim
> Subject: Re: Problem when running our code with tez
>
> JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a
> way you can get DagId within RecordWriter/OutputCommitter. A possible
> solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³
> mapreduce.workflow.node.name²). Note both are Pig specific configuration
> and only applicable if you run with Pig.
>
> Daniel
>
>
>
>
> On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:
>
> >+dev@pig as this might be a question better answered by Pig developers.
> >
> >This probably won¹t answer your question but should give you some
> >background info. When Pig uses Tez, it may end up running multiple dags
> >within the same YARN application therefore the ³jobId² ( in case of MR,
> >job Id maps to the application Id from YARN ) may not be unique.
> >Furthermore, there are cases where multiple vertices within the same
> >DAG could write to HDFS hence both dagId and vertexId are required to
> >guarantee uniqueness when writing to a common location.
> >
> >thanks
> >< Hitesh
> >
> >
> >On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com>
> wrote:
> >
> >> Hi,
> >>
> >> We are trying to run our existing workflows that contains pig
> >>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are
> >>facing some problems when we run our code with tez.
> >>
> >> In our code, we are writing and reading from/to a temp directory
> >>which we create with a name based on the  jobID:
> >>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and
> >>in the close() -we take the jobID from TaskAttemptContext context.
> >>Meaning, each task writes a file to
> >>           this  directory in the close () method according to the
> >>jobID from the context.
> >>    Part 2 -  In the end of the whole job (after all the tasks were
> >>completed), we have our custom outputCommitter (which extends the
> >>
> >>               org.apache.hadoop.mapreduce.OutputCommitter), and in
> >>the
> >>commitJob()  it looks for that directory of the job and handles all
> >>the files under it-  the jobID is taken from JobContext
> >>context.getJobID().toString()
> >>
> >>
> >>
> >> We noticed that when we use tez, this mechanism doesn't work since
> >>the jobID from the tez task (part one ) is combined from the original
> >>job
> >>id+vertex id , for example: 14404914675610 instead of 1440491467561 .
> >>id+So
> >>the directory name in part 2 is different than part 1.
> >>
> >>
> >> We looked for a way to retrieve only the vertex id or only the job id
> >>, but didn't find one - on the configuration the  property:
> >> mapreduce.job.id also had the addition of the vertex id, and no other
> >>property value was equal to the original job id.
> >>
> >> Can you please advise how can we solve this issue?  Is there a way to
> >>get the original jobID when we're in part 1?
> >>
> >> Regards,
> >> Shiri Marron
> >> Amdocs
> >>
> >> This message and the information contained herein is proprietary and
> >>confidential and subject to the Amdocs policy statement,  you may
> >>review at http://www.amdocs.com/email_disclaimer.asp
> >
> >
>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp
>

RE: Problem when running our code with tez

Posted by Shiri Marron <Sh...@amdocs.com>.

+Nir

-----Original Message-----
From: Hersh Shafer 
Sent: Thursday, August 27, 2015 11:45 AM
To: Daniel Dai; dev@tez.apache.org; dev@pig.apache.org; Shiri Marron
Cc: Almog Shunim
Subject: RE: Problem when running our code with tez

+Shiri

-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com]
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
> 
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context.
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>the
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>id+So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: Problem when running our code with tez

Posted by Shiri Marron <Sh...@amdocs.com>.

+Nir

-----Original Message-----
From: Hersh Shafer 
Sent: Thursday, August 27, 2015 11:45 AM
To: Daniel Dai; dev@tez.apache.org; dev@pig.apache.org; Shiri Marron
Cc: Almog Shunim
Subject: RE: Problem when running our code with tez

+Shiri

-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com]
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
> 
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context.
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>the
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>id+So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: Problem when running our code with tez

Posted by Hersh Shafer <He...@amdocs.com>.

+Shiri

-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com] 
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
> 
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context. 
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>the
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>id+So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: Problem when running our code with tez

Posted by Hersh Shafer <He...@amdocs.com>.

+Shiri

-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com] 
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
> 
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context. 
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>the
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>id+So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Problem when running our code with tez

Posted by Daniel Dai <da...@hortonworks.com>.

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a
way you can get DagId within RecordWriter/OutputCommitter. A possible
solution is to use conf.get(³mapreduce.workflow.id²) +
conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific
configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some
>background info. When Pig uses Tez, it may end up running multiple dags
>within the same YARN application therefore the ³jobId² ( in case of MR,
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same DAG
>could write to HDFS hence both dagId and vertexId are required to
>guarantee uniqueness when writing to a common location.
> 
>thanks
>‹ Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig scripts,
>>on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some
>>problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory which
>>we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in
>>the close() -we take the jobID from TaskAttemptContext context. Meaning,
>>each task writes a file to
>>           this  directory in the close () method according to the jobID
>>from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in the
>>commitJob()  it looks for that directory of the job and handles all the
>>files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since the
>>jobID from the tez task (part one ) is combined from the original job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id ,
>>but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and
>>confidential and subject to the Amdocs policy statement,
>> you may review at http://www.amdocs.com/email_disclaimer.asp
>
>

Re: Problem when running our code with tez

Posted by Daniel Dai <da...@hortonworks.com>.

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a
way you can get DagId within RecordWriter/OutputCommitter. A possible
solution is to use conf.get(³mapreduce.workflow.id²) +
conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific
configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some
>background info. When Pig uses Tez, it may end up running multiple dags
>within the same YARN application therefore the ³jobId² ( in case of MR,
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same DAG
>could write to HDFS hence both dagId and vertexId are required to
>guarantee uniqueness when writing to a common location.
> 
>thanks
>‹ Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig scripts,
>>on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some
>>problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory which
>>we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in
>>the close() -we take the jobID from TaskAttemptContext context. Meaning,
>>each task writes a file to
>>           this  directory in the close () method according to the jobID
>>from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in the
>>commitJob()  it looks for that directory of the job and handles all the
>>files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since the
>>jobID from the tez task (part one ) is combined from the original job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id ,
>>but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and
>>confidential and subject to the Amdocs policy statement,
>> you may review at http://www.amdocs.com/email_disclaimer.asp
>
>

Re: Problem when running our code with tez

Posted by Hitesh Shah <hi...@apache.org>.

+dev@pig as this might be a question better answered by Pig developers. 

This probably won’t answer your question but should give you some background info. When Pig uses Tez, it may end up running multiple dags within the same YARN application therefore the “jobId” ( in case of MR, job Id maps to the application Id from YARN ) may not be unique. Furthermore, there are cases where multiple vertices within the same DAG could write to HDFS hence both dagId and vertexId are required to guarantee uniqueness when writing to a common location. 
 
thanks
— Hitesh


On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:

> Hi,
> 
> We are trying to run our existing workflows that contains pig scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we run our code with tez.
> 
> In our code, we are writing and reading from/to a temp directory which we create with a name based on the  jobID:
>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the close() -we take the jobID from TaskAttemptContext context. Meaning, each task writes a file to
>           this  directory in the close () method according to the jobID from the context.
>    Part 2 -  In the end of the whole job (after all the tasks were completed), we have our custom outputCommitter (which extends the
> 
>               org.apache.hadoop.mapreduce.OutputCommitter), and in the commitJob()  it looks for that directory of the job and handles all the files under it-  the jobID is taken from JobContext context.getJobID().toString()
> 
> 
> 
> We noticed that when we use tez, this mechanism doesn't work since the jobID from the tez task (part one ) is combined from the original job id+vertex id , for example: 14404914675610 instead of 1440491467561 . So the directory name in part 2 is different than part 1.
> 
> 
> We looked for a way to retrieve only the vertex id or only the job id , but didn't find one - on the configuration the  property:
> mapreduce.job.id also had the addition of the vertex id, and no other property value was equal to the original job id.
> 
> Can you please advise how can we solve this issue?  Is there a way to get the original jobID when we're in part 1?
> 
> Regards,
> Shiri Marron
> Amdocs
> 
> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Problem when running our code with tez

Posted by Hitesh Shah <hi...@apache.org>.

+dev@pig as this might be a question better answered by Pig developers. 

This probably won’t answer your question but should give you some background info. When Pig uses Tez, it may end up running multiple dags within the same YARN application therefore the “jobId” ( in case of MR, job Id maps to the application Id from YARN ) may not be unique. Furthermore, there are cases where multiple vertices within the same DAG could write to HDFS hence both dagId and vertexId are required to guarantee uniqueness when writing to a common location. 
 
thanks
— Hitesh


On Aug 25, 2015, at 7:29 AM, Shiri Marron <Sh...@amdocs.com> wrote:

> Hi,
> 
> We are trying to run our existing workflows that contains pig scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we run our code with tez.
> 
> In our code, we are writing and reading from/to a temp directory which we create with a name based on the  jobID:
>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the close() -we take the jobID from TaskAttemptContext context. Meaning, each task writes a file to
>           this  directory in the close () method according to the jobID from the context.
>    Part 2 -  In the end of the whole job (after all the tasks were completed), we have our custom outputCommitter (which extends the
> 
>               org.apache.hadoop.mapreduce.OutputCommitter), and in the commitJob()  it looks for that directory of the job and handles all the files under it-  the jobID is taken from JobContext context.getJobID().toString()
> 
> 
> 
> We noticed that when we use tez, this mechanism doesn't work since the jobID from the tez task (part one ) is combined from the original job id+vertex id , for example: 14404914675610 instead of 1440491467561 . So the directory name in part 2 is different than part 1.
> 
> 
> We looked for a way to retrieve only the vertex id or only the job id , but didn't find one - on the configuration the  property:
> mapreduce.job.id also had the addition of the vertex id, and no other property value was equal to the original job id.
> 
> Can you please advise how can we solve this issue?  Is there a way to get the original jobID when we're in part 1?
> 
> Regards,
> Shiri Marron
> Amdocs
> 
> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp