You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jim Brennan (JIRA)" <ji...@apache.org> on 2017/12/15 22:39:00 UTC

[jira] [Commented] (MAPREDUCE-6948) TestJobImpl.testUnusableNodeTransition failed

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293374#comment-16293374 ] 

Jim Brennan commented on MAPREDUCE-6948:
----------------------------------------

I have been unable to reproduce this problem in trunk, nor in branch-2.8.   I have been able to repro in branch-2.7, but only by adding a sleep to exacerbate the race condition.

Analysis:
The key point in the failure case is here:
{{2017-08-30 10:12:22,000 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 1
2017-08-30 10:12:22,029 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 2
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl (JobImpl.java:actOnUnusableNode(1354)) - TaskAttempt killed because it ran on unusable node Mock for NodeId, hashCode: 1280187896. AttemptId:attempt_1234567890000_0001_m_000000_0
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 3
}}
At this point Num completed tasks should be 2.  Since it is 3, we start moving to the COMMITTED state too early and trip the failure.
In the successful case, the log looks like this:
{{2017-12-15 16:16:54,253 INFO  [Thread-0] impl.JobImpl (JobImpl.java:transition(1979)) - Num completed Tasks: 1
2017-12-15 16:16:54,258 INFO  [Thread-0] impl.JobImpl (JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,260 INFO  [Thread-0] impl.JobImpl (JobImpl.java:actOnUnusableNode(1359)) - TaskAttempt killed because it ran on unusable node Mock for NodeId, hashCode: 131679889. AttemptId:attempt_1234567890000_0001_m_000000_0
2017-12-15 16:16:54,261 INFO  [Thread-0] impl.JobImpl (JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,262 INFO  [Thread-0] impl.JobImpl (JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map task task_1234567890000_0001_m_000000
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl (JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map task task_1234567890000_0001_m_000001
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl (JobImpl.java:transition(1979)) - Num completed Tasks: 3}}

The second Num Completed Tasks:2 line corresponds to when we mark the Reducer task as SUCCEEDED.  At this point, the count of succeeded map tasks should be 1, because it was just decremented due to the unusable node.  It is incremented to 2 before printing.

The difference between branch-2.7, which fails, and trunk/branch-2.8 is the fix in MAPREDUCE-6675, which switched it to use a DrainDispatcher and added a dispatcher.await() call before we complete the reducer.

Another possible factor is YARN-5436, which fixed a very similar race in DrainDispatcher.  That one is present in trunk, but not in branch-2.8.  So it may account for intermittent failures in branch-2.8, but I was not able to reproduce it.

So as far as I can tell, this appears to be fixed already.

[~haibo.chen], can you provide any insight?  Any chance this failure was seen on branch-2.8 or branch-2.7?



> TestJobImpl.testUnusableNodeTransition failed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-6948
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6948
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Haibo Chen
>            Assignee: Jim Brennan
>              Labels: unit-test
>
> *Error Message*
> expected:<SUCCEEDED> but was:<ERROR>
> *Stacktrace*
> java.lang.AssertionError: expected:<SUCCEEDED> but was:<ERROR>
> 	at org.junit.Assert.fail(Assert.java:88)
> 	at org.junit.Assert.failNotEquals(Assert.java:743)
> 	at org.junit.Assert.assertEquals(Assert.java:118)
> 	at org.junit.Assert.assertEquals(Assert.java:144)
> 	at org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.assertJobState(TestJobImpl.java:1041)
> 	at org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.testUnusableNodeTransition(TestJobImpl.java:615)
> *Standard out*
> {code}
> 2017-08-30 10:12:21,928 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventType for class org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler
> 2017-08-30 10:12:21,939 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl$StubbedJob
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskEventType for class org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,941 INFO  [Thread-49] impl.JobImpl (JobImpl.java:setup(1534)) - Adding job token for job_1234567890000_0001 to jobTokenSecretManager
> 2017-08-30 10:12:21,941 WARN  [Thread-49] impl.JobImpl (JobImpl.java:setup(1540)) - Shuffle secret key missing from job credentials. Using job token secret as shuffle secret.
> 2017-08-30 10:12:21,944 INFO  [Thread-49] impl.JobImpl (JobImpl.java:makeUberDecision(1305)) - Not uberizing job_1234567890000_0001 because: not enabled;
> 2017-08-30 10:12:21,944 INFO  [Thread-49] impl.JobImpl (JobImpl.java:createMapTasks(1562)) - Input size for job job_1234567890000_0001 = 0. Number of splits = 2
> 2017-08-30 10:12:21,945 INFO  [Thread-49] impl.JobImpl (JobImpl.java:createReduceTasks(1579)) - Number of reduces for job job_1234567890000_0001 = 1
> 2017-08-30 10:12:21,945 INFO  [Thread-49] impl.JobImpl (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from NEW to INITED
> 2017-08-30 10:12:21,946 INFO  [Thread-49] impl.JobImpl (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from INITED to SETUP
> 2017-08-30 10:12:21,954 INFO  [CommitterEvent Processor #0] commit.CommitterEventHandler (CommitterEventHandler.java:run(231)) - Processing the event EventType: JOB_SETUP
> 2017-08-30 10:12:21,978 INFO  [AsyncDispatcher event handler] impl.JobImpl (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from SETUP to RUNNING
> 2017-08-30 10:12:21,983 INFO  [Thread-49] event.AsyncDispatcher (AsyncDispatcher.java:register(209)) - Registering class org.apache.hadoop.mapreduce.v2.app.job.event.TaskAttemptEventType for class org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl$5
> 2017-08-30 10:12:22,000 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 1
> 2017-08-30 10:12:22,029 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 2
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl (JobImpl.java:actOnUnusableNode(1354)) - TaskAttempt killed because it ran on unusable node Mock for NodeId, hashCode: 1280187896. AttemptId:attempt_1234567890000_0001_m_000000_0
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl (JobImpl.java:transition(1953)) - Num completed Tasks: 3
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from RUNNING to COMMITTING
> 2017-08-30 10:12:22,032 ERROR [Thread-49] impl.JobImpl (JobImpl.java:handle(1009)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at COMMITTING
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> 	at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1007)
> 	at org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.testUnusableNodeTransition(TestJobImpl.java:609)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> 2017-08-30 10:12:22,035 INFO  [AsyncDispatcher event handler] impl.JobImpl (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from COMMITTING to ERROR
> 2017-08-30 10:12:22,039 INFO  [CommitterEvent Processor #1] commit.CommitterEventHandler (CommitterEventHandler.java:run(231)) - Processing the event EventType: JOB_COMMIT
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org