You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2012/07/31 18:17:35 UTC
[jira] [Commented] (MAPREDUCE-4457) mr job invalid transition TA_TOO_MANY_FETCH_FAILURE at FAILED

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425885#comment-13425885 ] 

Robert Joseph Evans commented on MAPREDUCE-4457:
------------------------------------------------

When a Job gets a JOB_TASK_ATTEMPT_FETCH_FAILURE it checks to see if the failure percentage is too high 50% or more of running reducers are complaining about it and at least 3 reducer attempts have tried to get the data and failed, it will send a TA_TOO_MANY_FETCH_FAILURE event to the map task attempt.  The JobImpl stores the fetch failure count for each map attempt as it see the failures.  If the failures get high enough to send the TA_TOO_MANY_FETCH_FAILURE event the JobImpl then deletes the failure state for the map attempt.

The only scenario in which I can see this happening is a derivative of the following.  Assume we have a single map task, and 6 reduce tasks.  All six reduce tasks try to fetch from the map task at almost exactly the same time.  They all fail and report back that they failed.  After the first three report back the failure percent of running tasks is now exactly 50%, and is 3 or more, so the TA_TOO_MANY_FETCH_FAILURE event is sent and the state is reset.  The next three reducer fetch failures are processed and we now have 3 failures which again is exactly 50%, and 3 or more total failures resulting in another TA_TOO_MANY_FETCH_FAILURE event being sent.

There may be other situations too where the logic may get confused and it will send the event twice.  I cannot think of any, but there may be others.  We could either update the JobImpl to not send the event twice, or we could update the TaskAttemptImpl to handle getting it twice. 
                
> mr job invalid transition TA_TOO_MANY_FETCH_FAILURE at FAILED
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-4457
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4457
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Thomas Graves
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> we saw a job go into the ERROR state from an invalid state transition.
> 3,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_007743_0 TaskAttempt Transitioned from SUCCEEDED
> to FAILED
> 2012-07-16 08:49:53,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_008850_0 TaskAttempt Transitioned from SUCCEEDED
> to FAILED
> 2012-07-16 08:49:53,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_017344_1000 TaskAttempt Transitioned from RUNNING
> to SUCCESS_CONTAINER_CLEANUP
> 2012-07-16 08:49:53,601 ERROR [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this
> event at current state for attempt_1342238829791_2501_m_000027_0
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> TA_TOO_MANY_FETCH_FAILURE at FAILED
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>     at
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:954)
>     at
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:133)
>     at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:913)
>     at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:905)
>     at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>     at
> org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.realDispatch(RecoveryService.java:285)
>     at
> org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:281)
>     at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>     at java.lang.Thread.run(Thread.java:619)
> 2012-07-16 08:49:53,601 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_029091_1000 TaskAttempt Transitioned from RUNNING
> to SUCCESS_CONTAINER_CLEANUP
> 2012-07-16 08:49:53,601 INFO [IPC Server handler 17 on 47153]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
> attempt_1342238829791_2501_r_000461_1000
> It looks like we possibly got 2 TA_TOO_MANY_FETCH_FAILURE events. The first one moved it to FAILED and then the second one failed because no valid transition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira