You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tgravescs <gi...@git.apache.org> on 2014/05/02 22:49:07 UTC

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

GitHub user tgravescs opened a pull request:

    https://github.com/apache/spark/pull/621

    [WIP] SPARK-1676: Cache Hadoop UGIs by default to prevent FileSystem leak

    Move the doAs in Executor higher up so that we only have 1 ugi and aren't leaking filesystems. 
    Fix spark on yarn to work when the cluster is running as user "yarn" but the clients are launched as the user and want to read/write to hdfs as the user.
    
    Note this hasn't been fully tested yet.  Need to test in standalone mode, need to test to make sure it doesn't leak filesystems, need to look at the local mode backend.  One specific thing I need to look at in standalone mode is I don't think SPARK_USER is set when the CoarseGrainedExecutorBackend runs the doas in this case so that will have to change slightly or make sure its set when this is called.
    
    Putting this up for people to look at and possibly test.  I don't have access to a mesos cluster.
    
    This is alternative to https://github.com/apache/spark/pull/607

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tgravescs/spark SPARK-1676

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #621
    
----
commit 93988531d7ab4a4e902949744f80121593ba1f52
Author: Thomas Graves <tg...@apache.org>
Date:   2014-05-02T20:43:34Z

    change to have doAs in executor higher up.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12252315
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -24,25 +24,30 @@ import org.apache.hadoop.mapred.JobConf
     import org.apache.hadoop.security.Credentials
     import org.apache.hadoop.security.UserGroupInformation
     
    -import org.apache.spark.{SparkContext, SparkException}
    +import org.apache.spark.{Logging, SparkContext, SparkException}
     
     import scala.collection.JavaConversions._
     
     /**
      * Contains util methods to interact with Hadoop from Spark.
      */
    -class SparkHadoopUtil {
    +class SparkHadoopUtil extends Logging {
       val conf: Configuration = newConfiguration()
       UserGroupInformation.setConfiguration(conf)
     
    +  // IMPORTANT NOTE: If this function is going to be called repeated in the same process
    +  // you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
    +  // do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
       def runAsUser(user: String)(func: () => Unit) {
    --- End diff --
    
    Since the usage of this is now distributed to many places throughout Spark, can we add a comment for people who have no clue why it's there? Just something like "Runs the given function with a Hadoop UserGroupInformation as a thread local variable (distributed to child threads), used for authenticating HDFS and YARN calls." 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42096413
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14631/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094123
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42078000
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094099
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094848
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14630/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12251357
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala ---
    @@ -95,9 +95,13 @@ private[spark] class MesosExecutorBackend
      */
     private[spark] object MesosExecutorBackend {
       def main(args: Array[String]) {
    -    MesosNativeLibrary.load()
    -    // Create a new Executor and start it running
    -    val runner = new MesosExecutorBackend()
    -    new MesosExecutorDriver(runner).run()
    +    val sparkUser = Option(System.getenv("SPARK_USER")).getOrElse(SparkContext.SPARK_UNKNOWN_USER)
    --- End diff --
    
    We should probably add a Utils function for this, something like `Utils.getSparkUser`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42077867
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094847
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42095787
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42096412
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12251478
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -94,25 +95,32 @@ private[spark] class CoarseGrainedExecutorBackend(
     
     private[spark] object CoarseGrainedExecutorBackend {
       def run(driverUrl: String, executorId: String, hostname: String, cores: Int,
    -          workerUrl: Option[String]) {
    -    // Debug code
    -    Utils.checkHost(hostname)
    -
    -    val conf = new SparkConf
    -    // Create a new ActorSystem to run the backend, because we can't create a SparkEnv / Executor
    -    // before getting started with all our system properties, etc
    -    val (actorSystem, boundPort) = AkkaUtils.createActorSystem("sparkExecutor", hostname, 0,
    -      indestructible = true, conf = conf, new SecurityManager(conf))
    -    // set it
    -    val sparkHostPort = hostname + ":" + boundPort
    -    actorSystem.actorOf(
    -      Props(classOf[CoarseGrainedExecutorBackend], driverUrl, executorId,
    -        sparkHostPort, cores),
    -      name = "Executor")
    -    workerUrl.foreach{ url =>
    -      actorSystem.actorOf(Props(classOf[WorkerWatcher], url), name = "WorkerWatcher")
    +    workerUrl: Option[String]) {
    +
    +    val sparkUser = Option(System.getenv("SPARK_USER")).getOrElse(SparkContext.SPARK_UNKNOWN_USER)
    --- End diff --
    
    We could potentially just put this resolution of SPARK_USER inside runAsUser (calling it maybe runAsSparkUser), to avoid duplication of this logic and the weird SPARK_UNKNOWN_USER value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094271
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42086088
  
    This seems pretty reasonable to me, but it assumes that there is no value in recreating the user and re-transferring the current user's credentials. Is this the case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12252251
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -94,25 +95,32 @@ private[spark] class CoarseGrainedExecutorBackend(
     
     private[spark] object CoarseGrainedExecutorBackend {
       def run(driverUrl: String, executorId: String, hostname: String, cores: Int,
    -          workerUrl: Option[String]) {
    -    // Debug code
    -    Utils.checkHost(hostname)
    -
    -    val conf = new SparkConf
    -    // Create a new ActorSystem to run the backend, because we can't create a SparkEnv / Executor
    -    // before getting started with all our system properties, etc
    -    val (actorSystem, boundPort) = AkkaUtils.createActorSystem("sparkExecutor", hostname, 0,
    -      indestructible = true, conf = conf, new SecurityManager(conf))
    -    // set it
    -    val sparkHostPort = hostname + ":" + boundPort
    -    actorSystem.actorOf(
    -      Props(classOf[CoarseGrainedExecutorBackend], driverUrl, executorId,
    -        sparkHostPort, cores),
    -      name = "Executor")
    -    workerUrl.foreach{ url =>
    -      actorSystem.actorOf(Props(classOf[WorkerWatcher], url), name = "WorkerWatcher")
    +    workerUrl: Option[String]) {
    +
    +    val sparkUser = Option(System.getenv("SPARK_USER")).getOrElse(SparkContext.SPARK_UNKNOWN_USER)
    --- End diff --
    
    Yes, that's probably better than having `Utils.getSparkUser`. Maybe have a `runAsSparkUser` that calls this with the user `Option(System.getenv(...))...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42077873
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42092592
  
    There is no reason to recreate the user and repopulate the credentials/token unless the credentials/tokens are being updated in the ExecutorBackend process.  On yarn this definitely doesn't happen.  Once you start an executor it keeps the same credentials/tokens, the Yarn resourcemanager handles renewing the tokens.  As far as I know there isn't support for this built into spark for mesos and standalone but perhaps there is something I'm not aware of.  Is there anything you know of that does that, that I might have missed?    The only other case its useful to create a separate ugi is if we add support to run tasks as different users.
    
    Thanks for the comments and doing the standalone testing.   I'll update.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094097
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12252364
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala ---
    @@ -94,25 +95,32 @@ private[spark] class CoarseGrainedExecutorBackend(
     
     private[spark] object CoarseGrainedExecutorBackend {
       def run(driverUrl: String, executorId: String, hostname: String, cores: Int,
    -          workerUrl: Option[String]) {
    -    // Debug code
    -    Utils.checkHost(hostname)
    -
    -    val conf = new SparkConf
    -    // Create a new ActorSystem to run the backend, because we can't create a SparkEnv / Executor
    -    // before getting started with all our system properties, etc
    -    val (actorSystem, boundPort) = AkkaUtils.createActorSystem("sparkExecutor", hostname, 0,
    -      indestructible = true, conf = conf, new SecurityManager(conf))
    -    // set it
    -    val sparkHostPort = hostname + ":" + boundPort
    -    actorSystem.actorOf(
    -      Props(classOf[CoarseGrainedExecutorBackend], driverUrl, executorId,
    -        sparkHostPort, cores),
    -      name = "Executor")
    -    workerUrl.foreach{ url =>
    -      actorSystem.actorOf(Props(classOf[WorkerWatcher], url), name = "WorkerWatcher")
    +    workerUrl: Option[String]) {
    +
    +    val sparkUser = Option(System.getenv("SPARK_USER")).getOrElse(SparkContext.SPARK_UNKNOWN_USER)
    --- End diff --
    
    I believe runAsUser is not actually used anywhere else. We can probably just make it intended for only this exact use-case to not confuse users with its over-generality until it's necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42111962
  
    LGTM too. Thanks for the clarifications, guys. Merging into master, branch-1.0, and branch-0.9.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42097557
  
    @sryza so this looks good to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42078001
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14625/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42111008
  
    This does look good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094276
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42092767
  
    To add to what Tom said, there's a distinction between "renewing" tokens and "repopulating" them.  Renewing means extending the lifespan of existing tokens. Repopulating is with new tokens is not something that YARN currently does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42094124
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14629/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/621


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42095785
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42091219
  
    I have tested this on standalone mode and confirmed that the file handles do not leak.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/621#issuecomment-42095752
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-1676: Cache Hadoop UGIs by default...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/621#discussion_r12253872
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -24,25 +24,30 @@ import org.apache.hadoop.mapred.JobConf
     import org.apache.hadoop.security.Credentials
     import org.apache.hadoop.security.UserGroupInformation
     
    -import org.apache.spark.{SparkContext, SparkException}
    +import org.apache.spark.{Logging, SparkContext, SparkException}
     
     import scala.collection.JavaConversions._
     
     /**
      * Contains util methods to interact with Hadoop from Spark.
      */
    -class SparkHadoopUtil {
    +class SparkHadoopUtil extends Logging {
       val conf: Configuration = newConfiguration()
       UserGroupInformation.setConfiguration(conf)
     
    +  // IMPORTANT NOTE: If this function is going to be called repeated in the same process
    +  // you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
    +  // do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
       def runAsUser(user: String)(func: () => Unit) {
    --- End diff --
    
    Nit: "repeated" should be "repeatedly"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---