You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by vanzin <gi...@git.apache.org> on 2017/10/17 00:04:19 UTC

[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/19509

    [SPARK-22290][core] Avoid creating Hive delegation tokens when not necessary.

    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-22290

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19509
    
----
commit 95a9658043c86187cd9143923d0c1307df449004
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-10-16T22:28:58Z

    [SPARK-22290][core] Avoid creating Hive delegation tokens when not necessary.
    
    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    LGTM, just one minor comment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82888/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    **[Test build #82888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82888/testReport)** for PR 19509 at commit [`94223be`](https://github.com/apache/spark/commit/94223beaeffa9793fc1529bafd8a65b4b3185d7a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    >The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM.
    
    @vanzin , do we support in now? As I remembered it was not supported before.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19509#discussion_r145484282
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala ---
    @@ -347,6 +347,10 @@ package object config {
         .timeConf(TimeUnit.MILLISECONDS)
         .createWithDefault(Long.MaxValue)
     
    +  private[spark] val KERBEROS_RELOGIN_PERIOD = ConfigBuilder("spark.yarn.kerberos.relogin.period")
    +    .timeConf(TimeUnit.SECONDS)
    +    .createWithDefaultString("1m")
    --- End diff --
    
    The call to `checkTGTAndReloginFromKeytab` is a no-op if the TGT is still valid, so it's ok to call it often.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    It's always been supported, as long as they're not running at the same time. The only thing is that it was kinda broken with kerberos.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    I see, thanks for the explanation. I didn't think about such scenario.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    **[Test build #82825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82825/testReport)** for PR 19509 at commit [`95a9658`](https://github.com/apache/spark/commit/95a9658043c86187cd9143923d0c1307df449004).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19509


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    LGTM, merging to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    **[Test build #82888 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82888/testReport)** for PR 19509 at commit [`94223be`](https://github.com/apache/spark/commit/94223beaeffa9793fc1529bafd8a65b4b3185d7a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19509#discussion_r145329972
  
    --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala ---
    @@ -347,6 +347,10 @@ package object config {
         .timeConf(TimeUnit.MILLISECONDS)
         .createWithDefault(Long.MaxValue)
     
    +  private[spark] val KERBEROS_RELOGIN_PERIOD = ConfigBuilder("spark.yarn.kerberos.relogin.period")
    +    .timeConf(TimeUnit.SECONDS)
    +    .createWithDefaultString("1m")
    --- End diff --
    
    I think we should put this into doc. Also is it too frequent to call?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82825/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19509
  
    **[Test build #82825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82825/testReport)** for PR 19509 at commit [`95a9658`](https://github.com/apache/spark/commit/95a9658043c86187cd9143923d0c1307df449004).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org