You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liyinan926 <gi...@git.apache.org> on 2014/03/28 22:35:12 UTC

[GitHub] spark pull request: Added support for accessing secured HDFS

GitHub user liyinan926 opened a pull request:

    https://github.com/apache/spark/pull/265

    Added support for accessing secured HDFS

    Also changed the way task run so tasks always run under the user who submit the tasks. This replaces the old approach of using a environment variable SPARK_USER to specify the user, which is far less flexible. This eases security management since users no longer need to open access to HDFS files under their home directory to the user who starts the Spark cluster.
    
    Signed-off-by: Yinan Li <li...@gmail.com>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liyinan926/spark secure-hdfs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #265
    
----
commit 1e89e78d18b87acdeb15afe1ccaa92887ee15b74
Author: Yinan Li <li...@gmail.com>
Date:   2014-03-28T21:32:33Z

    Added support for accessing secured HDFS
    
    Also changed the way task run so tasks always run under the user who submit the tasks. This replaces the old approach of using a environment variable SPARK_USER to specify the user, which is far less flexible. This eases security management since users no longer need to open access to HDFS files under their home directory to the user who starts the Spark cluster.
    
    Signed-off-by: Yinan Li <li...@gmail.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38970939
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by huozhanfeng <gi...@git.apache.org>.
Github user huozhanfeng commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54252237
  
    I find it doesn't work well with config 'spark.eventLog.enabled'  and I 'm trying to solve this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38970923
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by huozhanfeng <gi...@git.apache.org>.
Github user huozhanfeng commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54840248
  
    @pwendell @tgravescs I have done some improvement for it and have created a new PR base on the newest master, you can work on it .
    
    PR:https://github.com/apache/spark/pull/2320
    JIRA:https://issues.apache.org/jira/browse/SPARK-3438
    
    I am using this patch now and I really hope it can be merged into the master so it can help others and I don't need to maintain the code.
    
    Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38971066
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13560/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38973056
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54873213
  
    I commented on the other pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/265


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-56577212
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20720/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54124179
  
    (I imagine part of the reason is that it doesn't merge into master, and failed tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by dkanoafry <gi...@git.apache.org>.
Github user dkanoafry commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-51365398
  
    hi, whatever happened to this PR? I am interested in reading data from secure HDFS into spark running on Mesos...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38971063
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38973045
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by liyinan926 <gi...@git.apache.org>.
Github user liyinan926 commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38970674
  
    This PR replaces https://github.com/apache/incubator-spark/pull/467.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38971229
  
    This is failing because of a style error:
    error file=/root/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/executor/Executor.scala message=File line length exceeds 100 characters line=192



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54870372
  
    Hey @huozhanfeng - from what I can tell your PR also has the same issue with security I was mentioning above. I think it's worth seeing whether the `addFile` serving can be authenticated easily. I agree it would be great to get this patch merged in since I think a few different companies are working on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38976772
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-38976773
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13562/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by SevadaAbraamyan <gi...@git.apache.org>.
Github user SevadaAbraamyan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/265#discussion_r13505208
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala ---
    @@ -75,6 +83,165 @@ class SparkHadoopUtil {
     
       def getSecretKeyFromUserCredentials(key: String): Array[Byte] = { null }
     
    +  /**
    +   * Return whether Hadoop security is enabled or not.
    +   *
    +   * @return Whether Hadoop security is enabled or not
    +   */
    +  def isSecurityEnabled(): Boolean = {
    +    UserGroupInformation.isSecurityEnabled
    +  }
    +
    +  /**
    +   * Do user authentication when Hadoop security is turned on. Used by the driver.
    +   *
    +   * @param sc Spark context
    +   */
    +  def doUserAuthentication(sc: SparkContext) {
    +    getAuthenticationType match {
    +      case "keytab" => {
    +        // Authentication through a Kerberos keytab file. Necessary for
    +        // long-running services like Shark/Spark Streaming.
    +        scheduleKerberosRenewTask(sc)
    +      }
    +      case _ => {
    +        // No authentication needed. Assuming authentication is already done
    +        // before Spark is launched, e.g., the user has authenticated with
    +        // Kerberos through kinit already.
    +        // Renew a Hadoop delegation token and store the token into a file.
    +        // Add the token file so it gets downloaded by every slave nodes.
    +        sc.addFile(initDelegationToken().toString)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Get the user whom the task belongs to.
    +   *
    +   * @param userName Name of the user whom the task belongs to
    +   * @return The user whom the task belongs to
    +   */
    +  def getTaskUser(userName: String): UserGroupInformation = {
    +    val ugi = UserGroupInformation.createRemoteUser(userName)
    +    // Change the authentication method to Kerberos
    +    ugi.setAuthenticationMethod(
    +      UserGroupInformation.AuthenticationMethod.KERBEROS)
    +    // Get and add Hadoop delegation tokens for the user
    +    val iter = getDelegationTokens().iterator()
    +    while (iter.hasNext) {
    +      ugi.addToken(iter.next())
    +    }
    +
    +    ugi
    +  }
    +
    +  /**
    +   * Get the type of Hadoop security authentication.
    +   *
    +   * @return Type of Hadoop security authentication
    +   */
    +  private def getAuthenticationType: String = {
    +    sparkConf.get("spark.hadoop.security.authentication")
    --- End diff --
    
    Should this not have a default value? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-67596353
  
    I think we should close this issue for now, since there's another more-recent PR to add the same feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by huozhanfeng <gi...@git.apache.org>.
Github user huozhanfeng commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54111302
  
    I want to know the reason that why this pull request is not be merge. Does it go against the roadmap of spark?
    
    I guess it is a usefull function and it can help others a lot. So I intend to test it on spark master and launch a new pull request.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added support for accessing secured HDFS

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/265#issuecomment-54778609
  
    @dkanoafry with this patch, the main issue I see is that it distributes the delegation tokens insecurity (through sc.AddFile)... so anyone could just read the tokens over the network and mimic the user who is running the Spark job. In fact we start an HTTP file server, so you wouldn't even need to observe the traffic you could just make a request against it. I'm guessing this is fine for the company submitting the patch, but it's too weak of a security model IMO to merge upstream.
    
    Since we've added more recently support for securing the HTTP file server through a shared secret I think this might be okay to pull in now. @tgravescs would you mind taking a quick look? I think the idea here is that in standalone mode a user would just log in with a keytab and send delegation tokens to the executors, with the main goal being to provide access to a secured HDFS deployment. Is there a way now for them to set a share secret to authenticate this HTTP request? (I think it's fine to assume that they just set something in a conf file on all of the worker nodes, i.e. we don't need to disseminate that secret).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org