You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@twill.apache.org by "Alvin Wang (JIRA)" <ji...@apache.org> on 2014/11/04 02:26:33 UTC
[jira] [Commented] (TWILL-106) HDFS delegation token is not being refreshed properly

    [ https://issues.apache.org/jira/browse/TWILL-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195525#comment-14195525 ] 

Alvin Wang commented on TWILL-106:
----------------------------------

Tested a simple Twill app that writes to an HDFS file every ~10 seconds and was able to reproduce this issue. I observed that according to UserGroupInformation.getCurrentUser().getTokens(), the HDFS delegation token is properly updated every 5 minutes as expected (Twill schedules the update to be 5 minutes less than dfs.namenode.delegation.token.renew-interval).

* After running for ~12 hours, the Twill app prints "WARN  o.a.h.security.UserGroupInformation - Exception encountered while running the renewal command. Aborting renew thread. org.apache.hadoop.util.Shell$ExitCodeException: kinit: Ticket expired while renewing credentials". 
* After running for < 24 hours, the Twill app repeatedly prints "ERROR examples.HelloWorld - Error org.apache.hadoop.ipc.RemoteException: token (HDFS_DELEGATION_TOKEN token XX for yarn) is expired".
* After running for ~24 hours, the Twill app repeatedly prints "ERROR examples.HelloWorld - Error org.apache.hadoop.ipc.RemoteException: token (HDFS_DELEGATION_TOKEN token XX for yarn) can't be found in cache".

The first WARN ("Ticket expired while renewing credentials") is likely due to the Kerberos ticket renewable life of 0. Hadoop spawns a Kerberos ticket renewal thread (UserGroupInformation.spawnAutoRenewalThreadForUserCreds()) that renews the Kerberos ticket via "kinit -R" and does reloginFromTicketCache(). I tried doing kinit with a relevant principal/keytab, and failed to renew via "kinit -R" apparently because the renewable life was 0.

With the same simple Twill app, and 7 day renewable life, the app can run for longer than 24 hours without getting either the "expired" nor the "can't be found in cache" errors.

Cluster configuration:
{code}
HDP 2.0
Hadoop 2.2.0.2.0.11.0-1 (source with checksum 4e0bbc06297bf19ac5705dc7ffcdb)
dfs.namenode.delegation.key.update-interval: 86400000 (1 day, default)
dfs.namenode.delegation.token.max-lifetime: 604800000 (1 week, default)
dfs.namenode.delegation.token.renew-interval: 600000 (10 minutes)
{code}

kadmin.local:  getprinc krbtgt/CONTINUUITY.NET@CONTINUUITY.NET
{code}
Principal: krbtgt/CONTINUUITY.NET@CONTINUUITY.NET
Expiration date: [never]
Last password change: [never]
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Sun Nov 02 08:14:12 UTC 2014 (root/admin@CONTINUUITY.NET)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 4
Key: vno 1, aes256-cts-hmac-sha1-96, no salt
Key: vno 1, aes128-cts-hmac-sha1-96, no salt
Key: vno 1, des3-cbc-sha1, no salt
Key: vno 1, arcfour-hmac, no salt
MKey: vno 1
Attributes:
Policy: [none]
{code}

kadmin.local:  getprinc yarn/cdap-secure120-1000.dev.continuuity.net@CONTINUUITY.NET
{code}
Principal: yarn/cdap-secure120-1000.dev.continuuity.net@CONTINUUITY.NET
Expiration date: [never]
Last password change: Tue Sep 23 04:50:48 UTC 2014
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Sun Nov 02 08:25:12 UTC 2014 (root/admin@CONTINUUITY.NET)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 4
Key: vno 2, aes256-cts-hmac-sha1-96, no salt
Key: vno 2, aes128-cts-hmac-sha1-96, no salt
Key: vno 2, des3-cbc-sha1, no salt
Key: vno 2, arcfour-hmac, no salt
MKey: vno 1
Attributes:
Policy: [none]
{code}

/etc/krb5.conf
{code}
# Generated by Chef for cdap-secure120-1000.dev.continuuity.net
# Local modifications will be overwritten.
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = CONTINUUITY.NET
 dns_lookup_realm = false
 dns_lookup_kdc = true
 forwardable = true
 renew_lifetime = 1825d
 ticket_lifetime = 24h

[realms]
 CONTINUUITY.NET = {
  kdc = test-kdc481-1000.dev.continuuity.net
  admin_server = test-kdc481-1000.dev.continuuity.net
 }

 continuuity.net = CONTINUUITY.NET
 .continuuity.net = CONTINUUITY.NET

[appdefaults]
 pam = {
   debug = false
   forwardable = true
   renew_lifetime = 1825d
   ticket_lifetime = 24h
   krb4_convert = false
 }
{code}

> HDFS delegation token is not being refreshed properly
> -----------------------------------------------------
>
>                 Key: TWILL-106
>                 URL: https://issues.apache.org/jira/browse/TWILL-106
>             Project: Apache Twill
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.4.0-incubating
>            Reporter: Poorna Chandra
>
> We have a Twill app that runs in a secure Hadoop cluster. The app starts up fine, and runs for a day. I can see in logs that say secure store was updated regularly. However, after a day I see exceptions that say "token (HDFS_DELEGATION_TOKEN token 4287 for yarn) can't be found in cache". 
> Exception:
> -------------
> 2014-10-23T04:12:42,101Z ERROR c.c.t.TransactionManager [cdap-secure120-1000.dev.continuuity.net] [tx-snapshot] TransactionManager:abortService(TransactionManager.java:594) - Aborting transaction manager due to: Snapshot (timestamp 1414037562088) failed due to: token (HDFS_DELEGATION_TOKEN token 4287 for yarn) can't be found in cache
> org.apache.hadoop.ipc.RemoteException: token (HDFS_DELEGATION_TOKEN token 4287 for yarn) can't be found in cache
>         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)