You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2015/04/16 12:50:58 UTC

[jira] [Updated] (AMBARI-10518) Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not initializing hdfs krb cache properly

     [ https://issues.apache.org/jira/browse/AMBARI-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated AMBARI-10518:
---------------------------------
    Summary: Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not initializing hdfs krb cache properly  (was: Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not kinit'd hdfs krb cache properly)

> Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not initializing hdfs krb cache properly
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-10518
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10518
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server, stacks
>    Affects Versions: 2.0.0
>         Environment: HDP 2.2.0.0 => 2.2.4.0
>            Reporter: Hari Sekhon
>         Attachments: errors-5543.txt, output-5543.txt
>
>
> After deploying the new HDP 2.2.4.0 stack to all nodes successfully in Ambari 2.0, the "perform upgrade" procedure fails on the first step:
> {code}Fail: 2015-04-16 11:36:32,623 - Performing a(n) upgrade of HDFS
> 2015-04-16 11:36:32,624 - u"Execute['/usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs']" {}
> 2015-04-16 11:36:32,811 - Prepare to transition into safemode state OFF
> 2015-04-16 11:36:32,812 - call['su - hdfs -c 'hdfs dfsadmin -safemode get''] {}
> 2015-04-16 11:36:36,481 - Command: su - hdfs -c 'hdfs dfsadmin -safemode get'
> Code: 255.
> 2015-04-16 11:36:36,481 - Error while executing command 'prepare_rolling_upgrade':
> Traceback (most recent call last):
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 214, in execute
>     method(env)
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 67, in prepare_rolling_upgrade
>     namenode_upgrade.prepare_rolling_upgrade()
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py", line 100, in prepare_rolling_upgrade
>     raise Fail("Could not transition to safemode state %s. Please check logs to make sure namenode is up." % str(SafeMode.OFF))
> Fail: Could not transition to safemode state OFF. Please check logs to make sure namenode is up.
> 2015-04-16 11:36:36,481 - Error while executing command 'prepare_rolling_upgrade':
> Traceback (most recent call last):
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 214, in execute
>     method(env)
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 67, in prepare_rolling_upgrade
>     namenode_upgrade.prepare_rolling_upgrade()
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py", line 100, in prepare_rolling_upgrade
>     raise Fail("Could not transition to safemode state %s. Please check logs to make sure namenode is up." % str(SafeMode.OFF))
> Fail: Could not transition to safemode state OFF. Please check logs to make sure namenode is up.{code}
> It looks like this is because the Kerberos cache was not properly initialized, as I can see an old expired cache:
> {code}
> # su - hdfs -c 'hdfs dfsadmin -safemode get'
> 15/04/16 11:42:23 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
> safemode: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "<host>/<ip>"; destination host is: "<host>":8020;
> # echo $?
> 255
> # su - hdfs
> [hdfs@<host> ~]$ klist
> Ticket cache: FILE:/tmp/krb5cc_1008
> Default principal: hdfs@LOCALDOMAIN
> Valid starting     Expires            Service principal
> 04/13/15 16:10:59  04/14/15 16:10:59  krbtgt/LOCALDOMAIN@LOCALDOMAIN
>         renew until 04/20/15 16:10:59
> [hdfs@<host> ~]$ /usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
> [hdfs@<host> ~]$ logout
> # su - hdfs -c 'hdfs dfsadmin -safemode get'
> Safe mode is OFF in <nn1>/<ip1>:8020
> Safe mode is OFF in <nn2>/<ip2>:8020
> {code}
> It looks like the kerberos cached was initialized for root instead of the hdfs user since the kinit command didn't have a su - hdfs with it.
> I had retried once with the same result to get the error again for this jira, but after I logged in as hdfs and manually kinit'd the hdfs user's krb cache and retried again in Ambari it succeeded, so that is the workaround for now.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)