You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2017/03/28 13:49:41 UTC

[jira] [Commented] (OOZIE-2847) Oozie Ha timing issue

    [ https://issues.apache.org/jira/browse/OOZIE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15945204#comment-15945204 ] 

Andras Piros commented on OOZIE-2847:
-------------------------------------

[~bpgergo] thanks for the contribution!

My thoughts:
* in the original stack trace I cannot see the log message {{Hadoop job id mismatch}} that would suggest the code path affected by the fix is being run
* how can Oozie be sure about that when a job Id file is present but empty, it's for sure the actual job Id of the {{JobConf}} entry {{mapred.job.id}}? The patch would always overwrite the empty file w/ the actual one
* from the unit test method name alone I cannot conclude what are the prerequisites, and what is the observable behavior. If possible, please use better naming, and multiple test methods each testing only one piece of functionality / use case

> Oozie Ha timing issue
> ---------------------
>
>                 Key: OOZIE-2847
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2847
>             Project: Oozie
>          Issue Type: Bug
>          Components: HA
>    Affects Versions: 4.3.0
>            Reporter: Péter Gergő Barna
>            Priority: Minor
>         Attachments: OOZIE-2847.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Oozie Ha timing issue
> When Oozie is launching the mapper, it is writing a job id into a file on hdfs. Let's assume the ApplicationMaster is killed, and Oozie will make a second try, during recovery. On the second try, Oozie is trying to see if the previously written job id on hdfs matches the current job id. In most occasion, this will match. However, in the event when Oozie launcher is killed right in the middle when Oozie is in the process of writing id in the file, the Oozie file in hdfs is created, but the id has yet to be written to the file. During the next recovery, Oozie will mistakenly think the id exists in the file while the file is actually empty, therefore throwing this exception: 
> {noformat}
> 2015-07-10 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> 2015-07-10 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|Console URL       : http://dal-ha21:8088/proxy/application_1436507526035_0001/
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error Code        : JA018
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error Message     : Hadoop job Id mismatch, action file [hdfs://hdp2-ha2/user/hadoopqa/oozie-hado/0000003-150710041341636-oozie-hado-W/pig-node--pig/0000003-150710041341636-oozie-hado-W@pig-node@0] declares Id [null] current Id [job_1436507526035_0001]
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External ID       : job_1436507526035_0001
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External Status   : FAILED/KILLED
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Name              : pig-node
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Retries           : 0
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Tracker URI       : dal-ha21:8032
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Type              : pig
> 2015-07-10 05:56:58,158|beaver.machine|INFO|5208|1344|MainThread|Started           : 2015-07-10 05:55:19 GMT
> 2015-07-10 05:56:58,160|beaver.machine|INFO|5208|1344|MainThread|Status            : ERROR
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|Ended             : 2015-07-10 05:56:42 GMT
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External Stats    : null
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External ChildIDs : null
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> Exception:
> 2015-07-10 05:56:18,658 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,665 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> 2015-07-10 05:56:18,693 WARN [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Unable to parse prior job history, aborting recovery
> java.io.IOException: Incompatible event log version: null
> 	at org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71)
> 	at org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.parsePreviousJobHistory(MRAppMaster.java:1206)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.processRecovery(MRAppMaster.java:1175)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1039)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1519)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
> 	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
> 2015-07-10 05:56:18,737 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,745 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)