You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by gss2002 <gi...@git.apache.org> on 2018/03/21 17:24:58 UTC

[GitHub] zeppelin pull request #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage relog...

GitHub user gss2002 opened a pull request:

    https://github.com/apache/zeppelin/pull/2886

    ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromKeytab needed

    What is this PR for?
    During long runs of Apache Zeppelin using HDFS as the backing configuration and notebook storage. We noticed that when the Zeppelin Server ticket had reached 7 days our max renewal time the keytab is not re-logged in leaving the Zeppelin Server in an unusable state. The solution is to reLoginFromKeytab before any operations as it will check if the ticket needs to be relogged in.
    
    What type of PR is it?
    [Bug Fix]
    
    Todos
    
    What is the Jira issue?
    https://issues.apache.org/jira/browse/ZEPPELIN-3356
    
    How should this be tested?
    Run Zeppelin Server for the max kerberos renewal time
    
    Screenshots (if appropriate)
    
    Questions:
    Does the licenses files need update? No
    Is there breaking changes for older versions? No
    Does this needs documentation? No
    Author: Greg Senia gsenia@apache.org

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gss2002/zeppelin ZEPPELIN-3356

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zeppelin/pull/2886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2886
    
----
commit dc190e5979ffaca2ae36cdbc5a171624ce5868d5
Author: gss2002 <gr...@...>
Date:   2018-03-21T16:33:34Z

    ZEPPELIN-3356: Zeppelin FileSystem Storage reloginFromKeytab needed

----


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @gss2002 I could not reproduce the issue. I change the krb5.conf as following:
    ```
      renew_lifetime = 7min
      ticket_lifetime = 3min
    ```
    
    And after one day, zeppelin still work properly. 


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @gss2002 Do you mean I have to change /var/kerberos/krb5kdc/kdc.conf to reproduce this issue ? 


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @prabhjyotsingh  @zjffdu can you help review if you feel this is a valid fix?
    Thanks again


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @prabhjyotsingh I just read that same stackoverflow part of me says use checktgtandreloginfronkeytab to be lighter on kdc thoughts?  I will dig a bit deeper in am but auto renewal thread that exists in ugi cannot go beyond max renewal
    @felixcheung I think you are right if I do usergroupinformation.getCurrentUser().checkTGtAndReloginFromKeytab() would work too
    
    private void reloginFromKeytab(boolean checkTGT) throws IOException {
        if (!shouldRelogin() || !isFromKeytab()) {
          return;
        }
        HadoopLoginContext login = getLogin();
        if (login == null) {
          throw new KerberosAuthException(MUST_FIRST_LOGIN_FROM_KEYTAB);
        }
        if (checkTGT) {
          KerberosTicket tgt = getTGT();
          if (tgt != null && !shouldRenewImmediatelyForTests &&
            Time.now() < getRefreshTime(tgt)) {
            return;
          }
        }
        relogin(login);
      }


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    https://stackoverflow.com/questions/38555244/how-do-you-set-the-kerberos-ticket-lifetime-from-java
    https://bugs.openjdk.java.net/browse/JDK-8044500


---

[GitHub] zeppelin pull request #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage relog...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 closed the pull request at:

    https://github.com/apache/zeppelin/pull/2886


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    Thanks @gss2002 will review it soon


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @zjffdu you cannot just update the krb5.conf those are just recommendations on the client side. The KDC both with MIT Krb5 and Active Directory control the max_renewable_lifetime via /var/kerberos/krb5kdc/kdc.conf and settings in Windows registry.  My co-worker and I tested this today and the ticket is still renewable because the KDC controls the max time and it looks as if Java takes info from the KDC... Using the CLI kinit/klist and hadoop fs the ticket is expired. But from the looks of it when logging in with a keytab via UGI which zeppelin does for the HDFS calls it takes the settings from the kdc...  
    
    See below:
    JDK - KRB5 DEBUG OUTPUT from Zeppelin JVM:
     
    Native config name: /etc/krb5.conf
    Loaded from native config
    >>> KdcAccessibility: reset
    >>> KdcAccessibility: reset
    >>> KeyTabInputStream, readName(): UNIT.HDP.EXAMPLE.COM
    >>> KeyTabInputStream, readName(): zeppelin-unit
    >>> KeyTab: load() entry length: 88; type: 18
    >>> KeyTabInputStream, readName(): UNIT.HDP.EXAMPLE.COM
    >>> KeyTabInputStream, readName(): zeppelin-unit
    >>> KeyTab: load() entry length: 72; type: 17
    >>> KeyTabInputStream, readName(): UNIT.HDP.EXAMPLE.COM
    >>> KeyTabInputStream, readName(): zeppelin-unit
    >>> KeyTab: load() entry length: 72; type: 23
    Looking for keys for: zeppelin-unit@UNIT.HDP.EXAMPLE.COM
    Added key: 23version: 2
    Added key: 17version: 2
    Added key: 18version: 2
    Looking for keys for: zeppelin-unit@UNIT.HDP.EXAMPLE.COM
    Added key: 23version: 2
    Added key: 17version: 2
    Added key: 18version: 2
    Using builtin default etypes for default_tkt_enctypes
    default etypes for default_tkt_enctypes: 18 17 16 23.
    >>> KrbAsReq creating message
    >>> KrbKdcReq send: kdc=ha21d51kd.unit.hdp.example.com TCP:88, timeout=30000, number of retries =3, #bytes=174
    >>> KDCCommunication: kdc=ha21d51kd.unit.hdp.example.com TCP:88, timeout=30000,Attempt =1, #bytes=174
    >>>DEBUG: TCPClient reading 769 bytes
    >>> KrbKdcReq send: #bytes read=769
    >>> KdcAccessibility: remove ha21d51kd.unit.hdp.example.com
    Looking for keys for: zeppelin-unit@UNIT.HDP.EXAMPLE.COM
    Added key: 23version: 2
    Added key: 17version: 2
    Added key: 18version: 2
    >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
    >>> KrbAsRep cons in KrbAsReq.getReply zeppelin-unit
    Found ticket for zeppelin-unit@UNIT.HDP.EXAMPLE.COM to go to krbtgt/UNIT.HDP.EXAMPLE.COM@UNIT.HDP.EXAMPLE.COM expiring on Wed Mar 28 23:28:46 EDT 2018
    Entered Krb5Context.initSecContext with state=STATE_NEW
    Found ticket for zeppelin-unit@UNIT.HDP.EXAMPLE.COM to go to krbtgt/UNIT.HDP.EXAMPLE.COM@UNIT.HDP.EXAMPLE.COM expiring on Wed Mar 28 23:28:46 EDT 2018
    Service ticket not found in the subject
    >>> Credentials acquireServiceCreds: same realm
    Using builtin default etypes for default_tgs_enctypes
    default etypes for default_tgs_enctypes: 18 17 16 23.
     
     
    Showing Zeppelin was started after modifying /etc/krb5.conf 2m/5m ticket_lifetime/renew_lifetime
     
    [root@ha21d55en zeppelin]# ps guaxww | grep -i zeppelin
    zeppelin  89982  2.4  3.6 6872888 601888 ?      Sl   13:28   0:30 /usr/jdk64/jdk1.8.0_102/bin/java -Dsun.security.krb5.debug=true -Dhdp.version=2.5.3.18-5 -Dspark.executor.memory=512m -Dspark.yarn.queue=default -Dfile.encoding=UTF-8 -Xms1024m -Xmx1024m -XX:MaxPermSize=512m -Dlog4j.configuration=file:///usr/local/zeppelin/current/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-zeppelin-ha21d55en.unit.hdp.example.com.log -cp ::/usr/local/zeppelin/current/lib/interpreter/*:/usr/local/zeppelin/current/lib/*:/usr/local/zeppelin/current/*::/usr/local/zeppelin/current/conf:/etc/hadoop/conf org.apache.zeppelin.server.ZeppelinServer
    zeppelin  90439  0.0  0.0 113124  1524 ?        S    13:30   0:00 /bin/bash /usr/local/zeppelin/current/bin/interpreter.sh -d /usr/local/zeppelin/current/interpreter/livy -c 10.70.57.5 -p 41478 -r : -l /usr/local/zeppelin/current/local-repo/livy1 -g livy1
    zeppelin  90454  0.0  0.0 113120   836 ?        S    13:30   0:00 /bin/bash /usr/local/zeppelin/current/bin/interpreter.sh -d /usr/local/zeppelin/current/interpreter/livy -c 10.70.57.5 -p 41478 -r : -l /usr/local/zeppelin/current/local-repo/livy1 -g livy1
    zeppelin  90455  0.3  1.3 5198944 214228 ?      Sl   13:30   0:04 /usr/jdk64/jdk1.8.0_102/bin/java -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///usr/local/zeppelin/current/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-livy1-zeppelin-ha21d55en.unit.hdp.example.com.log -Xms1024m -Xmx1024m -XX:MaxPermSize=512m -cp :/usr/local/zeppelin/current/interpreter/livy/*:/usr/local/zeppelin/current/lib/interpreter/*: org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer 10.70.57.5 41478 :
    zeppelin  91409  0.0  0.0 113124  1528 ?        S    13:35   0:00 /bin/bash /usr/local/zeppelin/current/bin/interpreter.sh -d /usr/local/zeppelin/current/interpreter/livy -c 10.70.57.5 -p 46276 -r : -l /usr/local/zeppelin/current/local-repo/livy -g livy
    zeppelin  91424  0.0  0.0 113120   836 ?        S    13:35   0:00 /bin/bash /usr/local/zeppelin/current/bin/interpreter.sh -d /usr/local/zeppelin/current/interpreter/livy -c 10.70.57.5 -p 46276 -r : -l /usr/local/zeppelin/current/local-repo/livy -g livy
    zeppelin  91425  0.3  1.0 4400176 167268 ?      Sl   13:35   0:03 /usr/jdk64/jdk1.8.0_102/bin/java -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///usr/local/zeppelin/current/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-livy-zeppelin-ha21d55en.unit.hdp.example.com.log -Xms1024m -Xmx1024m -XX:MaxPermSize=512m -cp :/usr/local/zeppelin/current/interpreter/livy/*:/usr/local/zeppelin/current/lib/interpreter/*: org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer 10.70.57.5 46276 :
    root      93403  0.0  0.0 112652   992 pts/0    S+   13:49   0:00 grep --color=auto -i zeppelin
    [root@ha21d55en zeppelin]# ls -latr /etc/krb5.conf
    -r--r--r--. 1 root root 6048 Mar 28 13:27 /etc/krb5.conf
    [root@ha21d55en zeppelin]# cat /etc/krb5.conf
    [logging]
        default = FILE:/var/log/krb5libs.log
        admin_server = FILE:/var/log/kadmind.log
        kdc = FILE:/var/log/krb5kdc.log
     
    [libdefaults]
        default_realm = UNIT.HDP.EXAMPLE.COM
        dns_lookup_kdc = true
        dns_lookup_realm = true
        udp_preference_limit = 1
        ticket_lifetime = 2m
        renew_lifetime = 5m
        forwardable = true
        canonicalize = false
        rdns = false
     


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @zjffdu I will test ASAP. Thanks for the help with this one. I'll grab the latest commit and let you know!



---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @zjffdu I am going to cut the new improved fix based on original feedback. But yes you will have to adjust the KDC to test this as Java does not use ticket_lifetime or renew_lifetime from krb5.conf per this article not fixed until Java 9.
    
    https://stackoverflow.com/questions/38555244/how-do-you-set-the-kerberos-ticket-lifetime-from-java
    https://bugs.openjdk.java.net/browse/JDK-8044500


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @prabhjyotsingh @zjffdu I made changes to check if security is enabled and if it was logged in via a keytab and than i relogin with checktgt method vs relogining in every time causing excess load on the kdc


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by gss2002 <gi...@git.apache.org>.
Github user gss2002 commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @zjffdu here is a patch that I think will fix this issue. I will know in 7 days if the issue comes back but has plagued our 4 different environments running Zeppelin over the last few days since it has reached max timeout. Let me know your thoughts on this patch. Also the CI failures look to be un-related.


---

[GitHub] zeppelin issue #2886: ZEPPELIN-3356: Zeppelin FileSystemStorage reloginFromK...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/2886
  
    @gss2002 I just found the root cause is that UserGroupInformation.loginUserFromKeytab called multiple times and created PR #2924 to fix it,  I have verified it, could you help verified it if you have time ?


---