You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wei-Chiu Chuang (Jira)" <ji...@apache.org> on 2021/11/15 07:46:00 UTC

[jira] [Created] (SPARK-37329) File system delegation tokens are leaked

Wei-Chiu Chuang created SPARK-37329:
---------------------------------------

             Summary: File system delegation tokens are leaked
                 Key: SPARK-37329
                 URL: https://issues.apache.org/jira/browse/SPARK-37329
             Project: Spark
          Issue Type: Bug
          Components: Security, YARN
    Affects Versions: 2.4.0
            Reporter: Wei-Chiu Chuang


On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS accumulated millions of delegation tokens that are not cancelled even after jobs are finished, and KMS goes out of memory within a day because of the delegation token leak.

We were able to reproduce the bug in a smaller test cluster, and realized when a Spark job starts, it acquires two delegation tokens, and only one is cancelled properly after the job finishes. The other one is left over and linger around for up to 7 days ( default Hadoop delegation token life time).

YARN handles the lifecycle of a delegation token properly if its renewer is 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation token with the job issuer as the renewer, simply to get the token renewal interval. The token is then ignored but not cancelled.

Propose: cancel the delegation token immediately after the token renewal interval is obtained.

Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org