You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Andrew Johnson <aj...@etsy.com> on 2015/03/14 14:56:17 UTC

ApplicationMaster Retrying Connection to Dead Node

Hey everyone,

I have encountered a troubling issue caused by a node in my cluster dying.
I had a node die due to a hardware issue while several MR jobs were running
on the cluster, which is running YARN.  I noticed that these jobs took over
four hours longer than expected to finish.  After investigating I found
that the ApplicationMaster for these jobs was retrying to connect to the
node that had died for those four hours.  I see this repeated in the AM
logs for that entire period:

2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20

Eventually the following exception appeared in the AM logs and the job
completed successfully:
2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
container container_1423675803126_127109_01_004637 :
org.apache.hadoop.net.ConnectTimeoutException: Call From
am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=dead.node.host/dead.node.ip:47936]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=dead.node.host/dead.node.ip:47936]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 15 more

It looks to me that the tasks that had been running on the dead node were
restarted, and the AM attempting to clean up those tasks.  However, since
the node was dead it would not be able to connect.

I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
it retry the connection 20 times in the logs, but then it starts retrying
from 0 again.  Also, I would expect the AM to give up the attempt to
connect much sooner.  For instance, the ResourceManager recognized the node
as dead after 10 minutes as expected.  I'd like to see the AM do the same.

Has anyone encountered this behavior before?

Thanks!

-- 
Andrew Johnson

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <aj...@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <aj...@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <aj...@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <aj...@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I had tried applying the patch from
https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed somewhat
relevant.  Unfortunately that did not fix my issue.

Does anyone have any other suggestions for how to resolve this?

On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:

> Hey everyone,
>
> I have encountered a troubling issue caused by a node in my cluster
> dying.  I had a node die due to a hardware issue while several MR jobs were
> running on the cluster, which is running YARN.  I noticed that these jobs
> took over four hours longer than expected to finish.  After investigating I
> found that the ApplicationMaster for these jobs was retrying to connect to
> the node that had died for those four hours.  I see this repeated in the AM
> logs for that entire period:
>
> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>
> Eventually the following exception appeared in the AM logs and the job
> completed successfully:
> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
> container container_1423675803126_127109_01_004637 :
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 15 more
>
> It looks to me that the tasks that had been running on the dead node were
> restarted, and the AM attempting to clean up those tasks.  However, since
> the node was dead it would not be able to connect.
>
> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
> it retry the connection 20 times in the logs, but then it starts retrying
> from 0 again.  Also, I would expect the AM to give up the attempt to
> connect much sooner.  For instance, the ResourceManager recognized the node
> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>
> Has anyone encountered this behavior before?
>
> Thanks!
>
> --
> Andrew Johnson
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I had tried applying the patch from
https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed somewhat
relevant.  Unfortunately that did not fix my issue.

Does anyone have any other suggestions for how to resolve this?

On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:

> Hey everyone,
>
> I have encountered a troubling issue caused by a node in my cluster
> dying.  I had a node die due to a hardware issue while several MR jobs were
> running on the cluster, which is running YARN.  I noticed that these jobs
> took over four hours longer than expected to finish.  After investigating I
> found that the ApplicationMaster for these jobs was retrying to connect to
> the node that had died for those four hours.  I see this repeated in the AM
> logs for that entire period:
>
> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>
> Eventually the following exception appeared in the AM logs and the job
> completed successfully:
> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
> container container_1423675803126_127109_01_004637 :
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 15 more
>
> It looks to me that the tasks that had been running on the dead node were
> restarted, and the AM attempting to clean up those tasks.  However, since
> the node was dead it would not be able to connect.
>
> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
> it retry the connection 20 times in the logs, but then it starts retrying
> from 0 again.  Also, I would expect the AM to give up the attempt to
> connect much sooner.  For instance, the ResourceManager recognized the node
> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>
> Has anyone encountered this behavior before?
>
> Thanks!
>
> --
> Andrew Johnson
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I had tried applying the patch from
https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed somewhat
relevant.  Unfortunately that did not fix my issue.

Does anyone have any other suggestions for how to resolve this?

On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:

> Hey everyone,
>
> I have encountered a troubling issue caused by a node in my cluster
> dying.  I had a node die due to a hardware issue while several MR jobs were
> running on the cluster, which is running YARN.  I noticed that these jobs
> took over four hours longer than expected to finish.  After investigating I
> found that the ApplicationMaster for these jobs was retrying to connect to
> the node that had died for those four hours.  I see this repeated in the AM
> logs for that entire period:
>
> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>
> Eventually the following exception appeared in the AM logs and the job
> completed successfully:
> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
> container container_1423675803126_127109_01_004637 :
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 15 more
>
> It looks to me that the tasks that had been running on the dead node were
> restarted, and the AM attempting to clean up those tasks.  However, since
> the node was dead it would not be able to connect.
>
> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
> it retry the connection 20 times in the logs, but then it starts retrying
> from 0 again.  Also, I would expect the AM to give up the attempt to
> connect much sooner.  For instance, the ResourceManager recognized the node
> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>
> Has anyone encountered this behavior before?
>
> Thanks!
>
> --
> Andrew Johnson
>



-- 
Andrew Johnson
Software Engineer, Etsy

Re: ApplicationMaster Retrying Connection to Dead Node

Posted by Andrew Johnson <aj...@etsy.com>.
I had tried applying the patch from
https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed somewhat
relevant.  Unfortunately that did not fix my issue.

Does anyone have any other suggestions for how to resolve this?

On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <aj...@etsy.com> wrote:

> Hey everyone,
>
> I have encountered a troubling issue caused by a node in my cluster
> dying.  I had a node die due to a hardware issue while several MR jobs were
> running on the cluster, which is running YARN.  I noticed that these jobs
> took over four hours longer than expected to finish.  After investigating I
> found that the ApplicationMaster for these jobs was retrying to connect to
> the node that had died for those four hours.  I see this repeated in the AM
> logs for that entire period:
>
> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>
> Eventually the following exception appeared in the AM logs and the job
> completed successfully:
> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
> container container_1423675803126_127109_01_004637 :
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 15 more
>
> It looks to me that the tasks that had been running on the dead node were
> restarted, and the AM attempting to clean up those tasks.  However, since
> the node was dead it would not be able to connect.
>
> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
> it retry the connection 20 times in the logs, but then it starts retrying
> from 0 again.  Also, I would expect the AM to give up the attempt to
> connect much sooner.  For instance, the ResourceManager recognized the node
> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>
> Has anyone encountered this behavior before?
>
> Thanks!
>
> --
> Andrew Johnson
>



-- 
Andrew Johnson
Software Engineer, Etsy