You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Daryn Sharp (JIRA)" <ji...@apache.org> on 2019/03/20 20:53:00 UTC

[jira] [Commented] (HADOOP-16130) Support delegation token operations in KMS Benchmark

    [ https://issues.apache.org/jira/browse/HADOOP-16130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797558#comment-16797558 ] 

Daryn Sharp commented on HADOOP-16130:
--------------------------------------

The problem was accumulation of tokens not being cancelled.  Jobs would erroneously set an option for the RM to never cancel tokens after job completion.  The unfounded worry was tokens would be prematurely cancelled if a job launched sub-jobs and exited before the sub-jobs complete.  Many years ago I added reference counting to tokens to avoid that very problem.

The Curator child recipes watch for and fetch/cache new secrets & tokens.  As the number of uncanceled tokens grew, so did the number of node watches, size of node listings (had to increase the response buffer!) to detect changes, zk cpu load increased, quorum consistency had severe latency, etc.  The tipping point was the propagation time for the quorum exceeded the time to: request a kms token, submit the job, RM getting a kerberos TGS, RM authenticating to kms. Once the quorum is hundreds of milliseconds or more out of sync, 1 kms rejects tokens issued by another kms in the bank. 

That took 4 kms servers and many hundreds of thousands of tokens.  The internal mitigation was completely disabling the RM's "don't cancel tokens" setting.

 

> Support delegation token operations in KMS Benchmark
> ----------------------------------------------------
>
>                 Key: HADOOP-16130
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16130
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: 3.3.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: George Huang
>            Priority: Major
>
> At the last Hadoop Contributors Meetup, [~daryn] shared another KMS throughput bottleneck is ZooKeeper -- KMS uses ZK to store delegation tokens. ZK would be brought to a halt when expired delegation tokens are purged. That sounds critical especially given that in most deployments KMS share the same ZK quorum as HDFS, it would cause NameNode failover.
> The current KMS benchmark does not support delegation token operations (addDelegationTokens, cancelDelegationToken, renewDelegationToken) so it's hard to understand how bad it is, and hard to quantify the improvement of a fix.
> File this jira to support those operations before we move on to the fix for the ZK issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org