You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Darpan Patel <da...@gmail.com> on 2015/12/02 18:19:28 UTC

HDFS can't be connected in the HA mode when one machine is down

Hi folks,

I am facing a strange issue.
I've a kerberized 5 node cluster ( with two HDFS NN masters) with Name
Nodes running in the HA , say master-1 and master-2. Master-2 also hosts
YARN resource manager, History server,etc. Last week we shutdown all the
machines. Out of which master node (master-2) is not starting due to
"unknown" reasons..

Today on the Edge node, I tried issuing the HDFS command (hadoop fs -ls /)
 it could list anything but exceptions : Exception while invoking
getFileInfo of class ClientNamenodeProtocolTranslatorPB

Moreover, I see that NN running on the master-1 is also shutting down
automatically. I started again but again it goes down. This is kind of
strange.
Here is the output of HDFS command and few logs.

I would be grateful if someone can help on this.

Thanks,
DP

*[root@edgenode ~]# hadoop fs -ls /*
15/12/02 16:52:32 INFO retry.RetryInvocationHandler: Exception while
invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over
<Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8020 after 1 fail over attempts.
Trying to fail over after sleeping for 933ms.
org.apache.hadoop.net.ConnectTimeoutException: Call From
<Host-NameNodeHost-1> /<NameNodeHost1-IP> to <Host-NameNodeHost-2>:8020
failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=<Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8020]; For more details
see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

..........
^C15/12/02 16:52:38 INFO retry.RetryInvocationHandler: Exception while
invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over
<Host-NameNodeHost-1> /IP-Namenode-1:8020 after 2 fail over attempts.
Trying to fail over after sleeping for 1527ms.
java.net.ConnectException: Call From
<Host-NameNodeHost-1>/<NameNodeHost1-IP> to <Host-NameNodeHost-1> :8020
failed on connection exception: java.net.ConnectException: Connection
refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)


*vi /var/log/hadoop/hdfs/hadoop-hdfs-namenode-HostNameNode-1.log*

com/IP-Namenode-2:8485. Already tried 35 time(s); maxRetries=45
2015-12-02 16:27:08,639 INFO  ipc.Server (Server.java:saslProcess(1383)) -
Auth successful for nn/<Hostname>@KDCRealm (auth:KERBEROS)
2015-12-02 16:27:08,703 INFO  authorize.ServiceAuthorizationManager
(ServiceAuthorizationManager.java:authorize(135)) - Authorization
successful for nn/<Host-NameNodeHost-1> @KDCRealm (auth:KERBEROS) for
protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2015-12-02 16:27:15,223 INFO  ipc.Client
(Client.java:handleConnectionTimeout(835)) - Retrying connect to server:
<Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8485. Already tried 36 time(s);
maxRetries=45
2015-12-02 16:27:17,201 INFO  queue.AuditFileSpool
(AuditFileSpool.java:runDoAs(780)) - Destination is down. sleeping for
30000 milli seconds. indexQueue=0,
queueName=hdfs.async.summary.multi_dest.batch,
consumer=hdfs.async.summary.multi_dest.batch.hdfs
2015-12-02 16:27:17,230 INFO  queue.AuditFileSpool
(AuditFileSpool.java:runDoAs(780)) - Destination is down. sleeping for
30000 milli seconds. indexQueue=0,
queueName=hdfs.async.summary.multi_dest.batch,
consumer=hdfs.async.summary.multi_dest.batch.db
2015-12-02 16:27:21,996 INFO  ipc.Server (Server.java:saslProcess(1383)) -
Auth successful for admin/admin@KDCRealm (auth:KERBEROS)

*vi /var/log/hadoop/hdfs/hadoop-hdfs-zkfc-HostNameNode-1.log*

2015-12-02 16:18:40,726 WARN  ha.HealthMonitor
(HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying
to monitor health of NameNode at <HostNameNode-1>/<IPofNameNode-1>:8020:
java.net.SocketTimeoutException: 45000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<IPofNameNode-1>:58279
remote=<HostNameNode-1>/<IPofNameNode-1>:8020] Call From
<HostNameNode-1>/<IPofNameNode-1> to <HostNameNode-1>:8020 failed on socket
timeout exception: java.net.SocketTimeoutException: 45000 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/<IPofNameNode-1>:58279
remote=<HostNameNode-1>/<IPofNameNode-1>:8020]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout