You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2018/04/10 16:08:00 UTC

[jira] [Commented] (HADOOP-15378) Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf

    [ https://issues.apache.org/jira/browse/HADOOP-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432530#comment-16432530 ] 

Wei-Chiu Chuang commented on HADOOP-15378:
------------------------------------------

[~Apache9] appreciate if you could also look at this one since you were the author of HADOOP-13433.

> Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15378
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15378
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: security
>         Environment: CDH5.8.3, Kerberized, Impala
>            Reporter: Wei-Chiu Chuang
>            Priority: Critical
>
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to relogin and crashed every several hours. Initial indication suggested the symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog server still kept crashing.
>  
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] PriviledgedActionException as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first kerberos ticket is not TGT(the server principal is hdfs/host2.example.com@EXAMPLE.COM), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first kerberos ticket is not TGT(the server principal is hdfs/host3.example.com@EXAMPLE.COM), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. Catalogd was unable to log in because of a Kerberos issue. The JDK expects the first kerberos ticket of a principal to be a TGT, however it seems that after this error, because it was unable to login successfully, the first ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of the principal, because it expects the TGT to be in the principal’s ticket, which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] PriviledgedActionException as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that the user’s Kerberos ticket has expired. However, that’s definitely not the case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as reported in HADOOP-15143.
>  
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over host4.example.com/10.0.121.66:8020 after 2 fail over attempts. Trying to fail over immediately. Java exception follows: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "host1.example.com/10.0.121.45"; destination host is: "host4.example.com":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1506) at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55) at org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:33) at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77) at org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85) at org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99) at com.cloudera.impala.catalog.CatalogServiceCatalog$CachePoolReader.run(CatalogServiceCatalog.java:193) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:826) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1555) at org.apache.hadoop.ipc.Client.call(Client.java:1478) ... 23 more Caused by: java.lang.NullPointerException at org.apache.hadoop.security.UserGroupInformation.fixKerberosTicketOrder(UserGroupInformation.java:1136) at org.apache.hadoop.security.UserGroupInformation.reloginFromTicketCache(UserGroupInformation.java:1272) at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:697) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1835) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:681) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:769) ... 26 more
> {code}
>  
> In any case, HADOOP-15143 does not fix the problem, since valid credentials were already removed from the user.
> Also, even after HADOOP-13433 was applied, I still saw the following error:
> {noformat}
> W0113 14:33:44.254727 1255277 UserGroupInformation.java:1838] PriviledgedActionException as:impala/host1.example.com@EXAMPLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> {noformat}
>  
> We traced the SaslException back to a particular remote DataNode. Finally, it turns out that the DataNode's host has an incorrect krb5.conf, which points to a decommissioned KDC, inconsistent with the rest of the hosts in the cluster. Upon authentication, it seems the error handling for the SaslException has a race condition, and occasionally it removes valid credential from the UGI. When we corrected the krb5.conf on that host, this error disappeared.
> This is not easy to reproduce, but we are told this issue seems to occur after Impala user performs a "refresh" command. Presumably this command forces Impala catalog server to connect to the problematic DataNode and triggered the buggy code.
>  
> I don't have a patch to fix it but I want to raise this issue for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org