You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2014/07/22 07:58:39 UTC

[jira] [Created] (HDFS-6721) Handle the situation where SBN is in zombie state

Ming Ma created HDFS-6721:
-----------------------------

             Summary: Handle the situation where SBN is in zombie state
                 Key: HDFS-6721
                 URL: https://issues.apache.org/jira/browse/HDFS-6721
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


Issue:

In HA setup, when the first NN in the service list is the SBN, RPC client will always try the first NN, get StandbyException and then failover to the second NN in the service list, which is the active NN.

This works pretty well when SBN is heathy. It also works well when SBN isn't running, for example, during rolling upgrade; in which case the client will get "java.net.ConnectException: Connection refused" right away.

Suggestions?
When SBN is in some zombie state, for example, machine is low in memory, SBN still runs, but can't do much, you will get ConnectTimeoutException.

{noformat}
14/07/21 04:12:42 DEBUG ipc.Client: Connecting to hadoop-foo-nn1/a.b.c.d:8020
14/07/21 04:13:02 DEBUG ipc.Client: closing ipc connection to hadoop-foo-nn1/a.b.c.d:8020: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=hadoop-foo-nn1/a.b.c.d:8020]
{noformat}

When this happens, each RPC client connection will waste 20 seconds before failover. That ends up slowing down MR jobs significantly.


Solution:
 
Perhaps this is the responsibility of external monitoring service for HDFS; it can detect machine in zombie state and restart the machine.

Can we have HDFS handle this automatically? States in ZK and DNs point to correct active NN. For example, Task JVM can get the hint for active NN from the DN on the local machine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)