You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Daniel Osvath (Jira)" <ji...@apache.org> on 2021/08/12 17:56:00 UTC

[jira] [Created] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x

Daniel Osvath created HDFS-16165:
------------------------------------

             Summary: Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
                 Key: HDFS-16165
                 URL: https://issues.apache.org/jira/browse/HDFS-16165
             Project: Hadoop HDFS
          Issue Type: Wish
         Environment: Can be reproduced in docker HDFS environment with Kerberos https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
            Reporter: Daniel Osvath


*Problem Description*

For more than a year Apache Kafka Connect users have been running into a Kerberos renewal issue that causes our HDFS2 connectors to fail. 

We have been able to consistently reproduce the issue under high load with 40 connectors (threads) that use the library. When we try an alternate workaround that uses the kerberos keytab on the system the connector operates without issues.

We identified the root cause to be a race condition bug in the Hadoop 2.x library that causes the ticker renewal to fail with the error below: 


{code:java}
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
 at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We reached the conclusion of the root cause once we tried the same environment (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated without renewal issues. Additionally, identifying that the synchronization issue has been fixed for the newer Hadoop 3.x releases  we confirmed our hypothesis about the root cause. Request
{code}

There are many changes in HDFS 3 [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java] related to UGI synchronization which were done as part of https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest some race conditions were happening with older version, i.e HDFS 2.x Which would explain why we can reproduce the problem with HDFS2.
For example(among others):
{code:java}
  private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
      throws IOException {
    // ensure the relogin is atomic to avoid leaving credentials in an
    // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
    // from accessing or altering credentials during the relogin.
    synchronized(login.getSubjectLock()) {
      // another racing thread may have beat us to the relogin.
      if (login == getLogin()) {
        unprotectedRelogin(login, ignoreLastLoginTime);
      }
    }
  }
{code}
All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 2.10.1), on which several CDH distributions are based. 

*Request*
We would like to ask for the synchronization fix to be backported to Hadoop 2.x so that our users can operate without issues. 

*Impact*
The older 2.x Hadoop version is used by our HDFS connector, which is used in production by our community. Currently, the issue causes our HDFS connector to fail, as it is unable to recover and renew the ticket at a later point. Having the backported fix would allow our users to operate without issues that require manual intervention every week (or few days in some cases). The only workaround available to community for the issue is to run a command or restart their workers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org