You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Jayush Luniya (JIRA)" <ji...@apache.org> on 2015/10/13 03:31:05 UTC

[jira] [Created] (AMBARI-13396) RU: Handle Namenode being down scenarios

Jayush Luniya created AMBARI-13396:
--------------------------------------

             Summary: RU: Handle Namenode being down scenarios
                 Key: AMBARI-13396
                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.1.2
            Reporter: Jayush Luniya
            Assignee: Jayush Luniya
             Fix For: 2.1.3


There are 2 scenarios that need to be handled during RU

*Setup:*
* host1 : namenode1, host2 :namenode2
* namenode1 on node1 is down

*Scenario 1: During RU, namenode1 on host1 is going to be upgraded before namenode2 on host2*
Since namenode1 on host1 is already down, namenode2 is the active namenode. So we  should fix the logic to simply restart namenode1 as namenode2 will remain active.

*Scenario 2: During RU, namenode2 on host2 is going to be upgraded before namenode1 on host1*
Since namenode2 on host2 is active, then we should fail, since there isn't another namenode instance that can become active. However today we do the following: 
# Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not healthy.
# When this command fails, we kill ZKFC on this host and then we wait for this instance to come back as standby which will never happen because this instance will come back as active. 

We should simply fail when "haadmin failover" command fails instead of killing ZKFC.

{noformat}
2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on active NameNode host jay-ams-2.c.pramod-thangali.internal.
2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] {'logoutput': True, 'user': 'hdfs'}
Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently healthy. Cannot be failover target
	at org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
	at org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
	at org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
	at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
	at org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
	at org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
	at org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)

2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently healthy. Cannot be failover target\n\tat org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid > /dev/null 2>&1 && ps -p `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid` > /dev/null 2>&1''] {}
2015-10-12 22:35:17,777 - call returned (0, '')
2015-10-12 22:35:17,778 - Execute['kill -15 `cat /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`'] {'user': 'hdfs'}
2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] {'action': ['delete']}
2015-10-12 22:35:17,803 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput': True, 'user': 'hdfs'}
2015-10-12 22:35:20,922 - call returned (1, '')
2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep standby'] {'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)