You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Ivan A. Veselovsky (JIRA)" <ji...@apache.org> on 2012/11/06 12:30:12 UTC
[jira] [Created] (MAPREDUCE-4774) repair test
org.apache.hadoop.mapred.TestClusterMRNotification.testMR
Ivan A. Veselovsky created MAPREDUCE-4774:
---------------------------------------------
Summary: repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR
Key: MAPREDUCE-4774
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Ivan A. Veselovsky
The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
In general, the test fails because the actual number and/or type of the notifications differs from the expected.
Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
at java.lang.Thread.run(Thread.java:662)
So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
(here we can see "ERROR" status instead of "FAILED")
After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494674#comment-13494674 ]
Hudson commented on MAPREDUCE-4774:
-----------------------------------
Integrated in Hadoop-Mapreduce-trunk #1253 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1253/])
MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407679)
Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407679
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494377#comment-13494377 ]
Robert Joseph Evans commented on MAPREDUCE-4774:
------------------------------------------------
The change looks simple enough and does fix the failing test. I am +1 p[ending Jenkins approval.
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494392#comment-13494392 ]
Robert Joseph Evans commented on MAPREDUCE-4774:
------------------------------------------------
I ran TestRecovery Manually and it looks like it is a spurious failure. We should file a JIRA to fix it. Checking in the patch now.
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494612#comment-13494612 ]
Hudson commented on MAPREDUCE-4774:
-----------------------------------
Integrated in Hadoop-Yarn-trunk #32 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/32/])
MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407679)
Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407679
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4774:
----------------------------------
Assignee: Jason Lowe
Target Version/s: 2.0.3-alpha, 0.23.5
Status: Patch Available (was: Open)
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 2.0.1-alpha, 0.23.3
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4774) repair test
org.apache.hadoop.mapred.TestClusterMRNotification.testMR
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4774:
----------------------------------
Attachment: MAPREDUCE-4774.patch
This test failure is pretty pervasive and annoying, so taking this to get it fixed quickly. Patch ignores some asynchronous task events in the FAILED state much like we do in the ERROR state, along with corresponding unit tests to verify we're handling them properly.
> repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR
> ---------------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Ivan A. Veselovsky
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494459#comment-13494459 ]
Hudson commented on MAPREDUCE-4774:
-----------------------------------
Integrated in Hadoop-trunk-Commit #2997 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2997/])
MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407679)
Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407679
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494389#comment-13494389 ]
Hadoop QA commented on MAPREDUCE-4774:
--------------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12552903/MAPREDUCE-4774.patch
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author tags.
{color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings.
{color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages.
{color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse.
{color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings.
{color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app:
org.apache.hadoop.mapreduce.v2.app.TestRecovery
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3006//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3006//console
This message is automatically generated.
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) repair test
org.apache.hadoop.mapred.TestClusterMRNotification.testMR
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493279#comment-13493279 ]
Jason Lowe commented on MAPREDUCE-4774:
---------------------------------------
Thanks for the analysis, Ivan! JobImpl's state machine is missing a number of events in the FAILED state. Due to the asynchronous nature of the job, task, and task attempt state machines, it is possible for tasks and task attempts to complete even though the job overall has decided to fail for other reasons. Therefore we need to ignore these additional events in the FAILED state to avoid their asynchronous arrival from knocking us out of the FAILED state and into the ERROR state.
JOB_TASK_COMPLETED
JOB_TASK_ATTEMPT_COMPLETED
JOB_MAP_TASK_RESCHEDULED
> repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR
> ---------------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Ivan A. Veselovsky
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe updated MAPREDUCE-4774:
----------------------------------
Component/s: mrv2
applicationmaster
Affects Version/s: 0.23.3
2.0.1-alpha
Summary: JobImpl does not handle asynchronous task events in FAILED state (was: repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR)
Editing headline to more accurately reflect the root cause.
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494647#comment-13494647 ]
Hudson commented on MAPREDUCE-4774:
-----------------------------------
Integrated in Hadoop-Hdfs-0.23-Build #431 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/431/])
svn merge -c 1407679 FIXES: MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407689)
Result = UNSTABLE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407689
Files :
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Joseph Evans updated MAPREDUCE-4774:
-------------------------------------------
Resolution: Fixed
Fix Version/s: 0.23.5
2.0.3-alpha
3.0.0
Status: Resolved (was: Patch Available)
Thanks Jason,
I put this into trunk, branch-2, and branch-0.23
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle
asynchronous task events in FAILED state
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494662#comment-13494662 ]
Hudson commented on MAPREDUCE-4774:
-----------------------------------
Integrated in Hadoop-Hdfs-trunk #1222 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1222/])
MAPREDUCE-4774. JobImpl does not handle asynchronous task events in FAILED state (jlowe via bobby) (Revision 1407679)
Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407679
Files :
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestJobImpl.java
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-4774
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Ivan A. Veselovsky
> Assignee: Jason Lowe
> Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
> Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed.
> The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok".
> In general, the test fails because the actual number and/or type of the notifications differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
> at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
> at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
> at java.lang.Thread.run(Thread.java:662)
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR
> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED".
> Need an expert advice on how that should be fixed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira