You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Zhijie Shen (JIRA)" <ji...@apache.org> on 2014/06/13 07:15:02 UTC

[jira] [Commented] (MAPREDUCE-5924) Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030268#comment-14030268 ] 

Zhijie Shen commented on MAPREDUCE-5924:
----------------------------------------

Some investigation based our logs:

TA_COMMIT_PENDING at COMMIT_PENDING happens because there're two TA_COMMIT_PENDING that have been sent to TaskAttemptListenerImpl. The first one causes the state transition from RUNNING to COMMIT_PENDING, and the second one causes the invalid transition.

Here's the code that notifies AM of committing the pending task.
{code}
// say the task tracker that task is commit pending
      while (true) {
        try {
          umbilical.commitPending(taskId, taskStatus);
          break;
        } catch (InterruptedException ie) {
          // ignore
        } catch (IOException ie) {
          LOG.warn("Failure sending commit pending: " + 
                    StringUtils.stringifyException(ie));
          if (--retries == 0) {
            System.exit(67);
          }
        }
      }
{code}

The loop has been iterated twice, because the following exception happens:

{code}
2014-06-09 09:15:38,539 WARN [main] org.apache.hadoop.mapred.Task: Failure sending commit pending: java.io.IOException: Failed on local exception: java.io.IOException: An existing connection was forcibly closed by the remote host; Host Details : local host is: "sdevin-clster53/10.215.16.72"; destination host is: "sdevin-clster54":63415; 
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
	at org.apache.hadoop.ipc.Client.call(Client.java:1414)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
	at com.sun.proxy.$Proxy9.commitPending(Unknown Source)
	at org.apache.hadoop.mapred.Task.done(Task.java:1006)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:397)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
	at sun.nio.ch.IOUtil.read(IOUtil.java:198)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949)
{code}

Throughout the log, I can see a lot of "IOException: An existing connection was forcibly closed by the remote host" (This issue seems to have been raised and solved in HADOOP-317, but it still occurs on Windows), and this particular exception instance results in re-sending committing message to AM.

In addition, it seem to have a logic problem about re-sending the committing message. Given the service side has done everything, but IOException happens on the response. The second should not to be sent by the client, or should be ignored by the service. This is something we need to fix at the MR side. There're two options to fix the invalid transition:

1. At the state machine, we ignore TA_COMMIT_PENDING when a task attempt is at COMMIT_PENDING. This is just a one-line fix.

2. We can add RetryCache for TaskAttemptListenerImpl, when commitPending is called, we check whether we have cache for a particular attempt Id. If we have, it means the transition has been executed before, and we can skip the transition.





> Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5924
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5924
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>
> Post the issue on behalf of [~yeshavora]:
> The Sort job over 1GB data failed with below error
> {code}
> 2014-06-09 09:15:38,746 INFO [Socket Reader #1 for port 63415] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1402304714683_0002 (auth:SIMPLE)
> 2014-06-09 09:15:38,750 INFO [IPC Server handler 13 on 63415] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state update from attempt_1402304714683_0002_r_000015_1000
> 2014-06-09 09:15:38,751 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this event at current state for attempt_1402304714683_0002_r_000015_1000
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1058)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:145)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1271)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1263)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:722)
> 2014-06-09 09:15:38,753 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1402304714683_0002Job Transitioned from RUNNING to ERROR
> {code}
> The JobHistory Url prints job state = ERROR



--
This message was sent by Atlassian JIRA
(v6.2#6252)