You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kellyzly <gi...@git.apache.org> on 2015/04/01 09:38:06 UTC

[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

GitHub user kellyzly opened a pull request:

    https://github.com/apache/spark/pull/5307

    [Spark-5682] Add spark encrypted shuffle by using chimera lib

    [Chimera](https://github.com/intel-hadoop/chimera) is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects. It provides JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec. JceAesCtrypoCodec uses encrypted algorithms  jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl provides. We can directly use Chimera to implement spark encrypted shuffle.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kellyzly/spark use_chimera

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5307.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5307
    
----
commit 1b9baa8f77ffa9ad04a98ad39910bffe7a340dbd
Author: kellyzly <ke...@126.com>
Date:   2015-04-01T07:23:09Z

    [Spark-5682] Add spark encrypted shuffle by using chimera lib

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-142137975
  
    @winningsix at the very least this PR needs to be updated to resolve conflicts.
    
    I need to take a look at the library being pulled; my main concern is that being a new library, it may not be super stable, and having it outside the project would make it more difficult to quickly address bugs (because you have to go thorough another project's release cycle before you can fix Spark).
    
    If it's not a big library, it might be worth it to pull it into Spark for the first release of this code, and later pull it out so that other Hadoop projects can also use it. Unless other projects are already using it, in which case my concerns are probably outdated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930474
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -123,12 +130,34 @@ private[spark] class DiskBlockObjectWriter(
        */
       private var numRecordsWritten = 0
     
    +  private var sparkConf:SparkConf = null
    +
       override def open(): BlockObjectWriter = {
         if (hasBeenClosed) {
           throw new IllegalStateException("Writer already closed. Cannot be reopened.")
         }
         fos = new FileOutputStream(file, true)
    -    ts = new TimeTrackingOutputStream(fos)
    +    var isEncryptedShuffle: Boolean = if (sparkConf != null) {
    +      sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +    }
    +    else {
    --- End diff --
    
    nit: should be in previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930454
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -20,10 +20,17 @@ package org.apache.spark.storage
     import java.io.{BufferedOutputStream, FileOutputStream, File, OutputStream}
     import java.nio.channels.FileChannel
     
    -import org.apache.spark.Logging
    -import org.apache.spark.serializer.{SerializationStream, Serializer}
    -import org.apache.spark.executor.ShuffleWriteMetrics
    +import org.apache.hadoop.mapreduce.security.TokenCache
     
    +import com.intel.chimera.{CipherSuite, CryptoOutputStream, CryptoCodec}
    +
    +import org.apache.spark.crypto.CommonConfigurationKeys.SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB
    +import org.apache.spark.crypto.CommonConfigurationKeys.SPARK_SHUFFLE_TOKEN
    +import org.apache.spark.crypto.CommonConfigurationKeys.DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB
    +import org.apache.spark.deploy.SparkHadoopUtil
    +import org.apache.spark.executor.ShuffleWriteMetrics
    +import org.apache.spark.serializer.{SerializationStream, Serializer}
    +import org.apache.spark.{Logging,SparkConf}
    --- End diff --
    
    nit: this is out of order (should be at the top of this group), also needs an empty line afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930413
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -20,10 +20,17 @@ package org.apache.spark.storage
     import java.io.{BufferedOutputStream, FileOutputStream, File, OutputStream}
     import java.nio.channels.FileChannel
     
    -import org.apache.spark.Logging
    -import org.apache.spark.serializer.{SerializationStream, Serializer}
    -import org.apache.spark.executor.ShuffleWriteMetrics
    +import org.apache.hadoop.mapreduce.security.TokenCache
     
    +import com.intel.chimera.{CipherSuite, CryptoOutputStream, CryptoCodec}
    +
    +import org.apache.spark.crypto.CommonConfigurationKeys.SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB
    --- End diff --
    
    nit: either `import org.apache.spark.crypto.CommonConfigurationKeys._` or `import org.apache.spark.crypto.CommonConfigurationKeys.{FOO, BAR}`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5307


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141255329
  
    @vanzin, could you help to clarify whether Spark currently includes encryption of shuffle data in transit? I believe that your SASL patch added this encryption? If so, I think that we should close this PR and the following JIRAs:
    
    - https://issues.apache.org/jira/browse/SPARK-5682
    - https://issues.apache.org/jira/browse/SPARK-6460
    
    I don't see any mention of SASL on https://spark.apache.org/docs/1.5.0/security.html, so if we do indeed have encryption then we should update the docs to say so. Right now, the docs claim that "SSL is not supported yet for WebUI and block transfer service", making it sound like we have no encryption support for data in-transit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930917
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ---
    @@ -296,15 +301,39 @@ final class ShuffleBlockFetcherIterator(
             // There is a chance that createInputStream can fail (e.g. fetching a local file that does
             // not exist, SPARK-4085). In that case, we should propagate the right exception so
             // the scheduler gets a FetchFailedException.
    -        Try(buf.createInputStream()).map { is0 =>
    -          val is = blockManager.wrapForCompression(blockId, is0)
    -          val iter = serializer.newInstance().deserializeStream(is).asIterator
    -          CompletionIterator[Any, Iterator[Any]](iter, {
    -            // Once the iterator is exhausted, release the buffer and set currentResult to null
    -            // so we don't release it again in cleanup.
    -            currentResult = null
    -            buf.release()
    -          })
    +        // is0:InputStream
    +        Try(buf.createInputStream()).map {
    +          is0 =>
    +            var is: InputStream = null
    +            val sparkConf = blockManager.conf
    +            val isEncryptedShuffle = if (sparkConf != null) {
    +              sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +            } else {
    +              false
    +            }
    +            if (isEncryptedShuffle) {
    +              val cryptoCodec: CryptoCodec = CryptoCodec.getInstance(CipherSuite.AES_CTR_NOPADDING)
    +              val bufferSize: Int = sparkConf.getInt(
    +                SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB,
    +                DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB) * 1024
    +              val iv: Array[Byte] = Array[Byte](0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    --- End diff --
    
    `new Array[Byte](someLength)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141261657
  
    Ah, gotcha. We should update this PR / JIRA title to better reflect this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by winningsix <gi...@git.apache.org>.
Github user winningsix commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141892023
  
    Hi @JoshRosen and @vanzin, do you have any further comments about this PR? The library Chimera is maintained by Intel. One way to address your comments is to move the library into Spark project. I'd like to move this jira forwards. So I am seeking for your suggestions and thoughts. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-156535938
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by winningsix <gi...@git.apache.org>.
Github user winningsix commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-142151286
  
    Thanks @vanzin for your prompt reply. In the current patch it consists of two parts. One is the encryption framework using JCE encryption and another is performance acceleration using Openssl library. Maybe we can get the first part in since it doesn't require the Chimera library. Then we could move the second part forwards either using Chimera or pull them into Spark. Any thoughts? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930996
  
    --- Diff: core/src/test/scala/org/apache/spark/crypto/JceAesCtrCryptoCodecSuite.scala ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.crypto
    +
    +import java.io.{ByteArrayInputStream, BufferedOutputStream, ByteArrayOutputStream}
    +import java.security.SecureRandom
    +
    +import org.apache.spark.{SparkConf, Logging}
    +import org.scalatest.FunSuite
    +import com.intel.chimera.{CryptoInputStream, CryptoOutputStream, JceAesCtrCryptoCodec, CryptoCodec}
    +
    +/**
    + * test JceAesCtrCryptoCodec
    + */
    +class JceAesCtrCryptoCodecSuite extends FunSuite with Logging {
    +
    +  test("TestCryptoCodecSuite"){
    +    val random: SecureRandom = new SecureRandom
    +    val dataLen: Int = 10000000
    +    val inputData: Array[Byte] = new Array[Byte](dataLen)
    +    val outputData: Array[Byte] = new Array[Byte](dataLen)
    +    random.nextBytes(inputData)
    +    // encrypt
    +    val sparkConf:SparkConf = new SparkConf()
    +    val codec: CryptoCodec =  new JceAesCtrCryptoCodec()
    +    val aos: ByteArrayOutputStream = new ByteArrayOutputStream
    +    val bos: BufferedOutputStream = new BufferedOutputStream(aos)
    +    val key: Array[Byte] = new Array[Byte](16)
    +    val iv: Array[Byte] = new Array[Byte](16)
    +    random.nextBytes(key)
    +    random.nextBytes(iv)
    +
    +    val cos: CryptoOutputStream = new CryptoOutputStream(bos, codec, 1024, key, iv)
    +    cos.write(inputData, 0, inputData.length)
    +    cos.flush
    +    // decrypt
    +    val cis: CryptoInputStream = new CryptoInputStream(new ByteArrayInputStream(aos.toByteArray),
    +      codec, 1024, key, iv)
    +    var readLen: Int = 0
    +    var outOffset: Int = 0
    +    while (readLen < dataLen) {
    +      val n: Int = cis.read(outputData, outOffset, outputData.length - outOffset)
    +      if (n >= 0) {
    +        readLen += n
    +        outOffset += n
    +      }
    +    }
    +    var i: Int = 0
    +    for(i <- 0 until dataLen )
    +    {
    +      if (inputData(i) != outputData(i)) {
    +        logInfo(s"decrypt failed:$i")
    +      }
    +    }
    +  }
    +}
    --- End diff --
    
    nit: need empty line at end of file (scalastyle will complain about this)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-88381961
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930690
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -123,12 +130,34 @@ private[spark] class DiskBlockObjectWriter(
        */
       private var numRecordsWritten = 0
     
    +  private var sparkConf:SparkConf = null
    +
       override def open(): BlockObjectWriter = {
         if (hasBeenClosed) {
           throw new IllegalStateException("Writer already closed. Cannot be reopened.")
         }
         fos = new FileOutputStream(file, true)
    -    ts = new TimeTrackingOutputStream(fos)
    +    var isEncryptedShuffle: Boolean = if (sparkConf != null) {
    +      sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +    }
    +    else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    --- End diff --
    
    In fact you could even inline the `conf.getBoolean` call here.
    
    (BTW you have this code repeated later in the file, same comments apply there.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-90760535
  
    Left some initial comments. I like the idea of using a library for this, my main concern is about the stability of that library (and who'll be maintaining it going forward).
    
    Aside from that, there are some parts of the code that need cleaning up and some style updates, and some broken error handling, but nothing too major.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930881
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ---
    @@ -296,15 +301,39 @@ final class ShuffleBlockFetcherIterator(
             // There is a chance that createInputStream can fail (e.g. fetching a local file that does
             // not exist, SPARK-4085). In that case, we should propagate the right exception so
             // the scheduler gets a FetchFailedException.
    -        Try(buf.createInputStream()).map { is0 =>
    -          val is = blockManager.wrapForCompression(blockId, is0)
    -          val iter = serializer.newInstance().deserializeStream(is).asIterator
    -          CompletionIterator[Any, Iterator[Any]](iter, {
    -            // Once the iterator is exhausted, release the buffer and set currentResult to null
    -            // so we don't release it again in cleanup.
    -            currentResult = null
    -            buf.release()
    -          })
    +        // is0:InputStream
    +        Try(buf.createInputStream()).map {
    +          is0 =>
    --- End diff --
    
    Move this to previous line and save one indentation level in the rest of the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931210
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -534,12 +538,51 @@ private[spark] class Client(
         // send the acl settings into YARN to control who has access via YARN interfaces
         val securityManager = new SecurityManager(sparkConf)
         amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
    +    val isEncryptedShuffle = if (sparkConf != null) {
    +      sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    +      setSparkShuffleTokens(credentials, amContainer)
    +      // amContainer.setEnvironment()  set the LD_LIBRARY_PATH like "xxx.so" here
    +    }
         setupSecurityToken(amContainer)
         UserGroupInformation.getCurrentUser().addCredentials(credentials)
     
         amContainer
       }
     
    +  def setSparkShuffleTokens(credentials:Credentials,amContainer:ContainerLaunchContext){
    +    if (TokenCache.getShuffleSecretKey(credentials) == null) {
    +      var keyGen: KeyGenerator = null
    +      try {
    +        val SHUFFLE_KEY_LENGTH: Int = 64
    +        var keyLen: Int = if (sparkConf.getBoolean(SPARK_ENCRYPTED_INTERMEDIATE_DATA,
    +          DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA) == true) {
    +          sparkConf.getInt(SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS,
    +            DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS)
    +        }
    +        else {
    +          SHUFFLE_KEY_LENGTH
    +        }
    +        val SHUFFLE_KEYGEN_ALGORITHM = "HmacSHA1"
    +        keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM)
    +        keyGen.init(keyLen)
    +      }
    +      catch {
    +        case e: NoSuchAlgorithmException => println("Error generating shuffle secret key")
    --- End diff --
    
    Another case of broken error handling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930313
  
    --- Diff: core/pom.xml ---
    @@ -359,6 +359,12 @@
           <artifactId>py4j</artifactId>
           <version>0.8.2.1</version>
         </dependency>
    +    <dependency>
    +      <groupId>com.intel.chimera</groupId>
    +      <artifactId>chimera</artifactId>
    +      <version>0.0.1</version>
    --- End diff --
    
    So, how stable is this project? It seems, from the github page, to be a re-packaging of the Hadoop code with maybe some more bells & whistles, but the version number doesn't scream stability...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930722
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -136,6 +165,17 @@ private[spark] class DiskBlockObjectWriter(
         this
       }
     
    +  def createIV(cryptoCodec: CryptoCodec): Array[Byte] = {
    --- End diff --
    
    IV = initial value? Could you spell it out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by kellyzly <gi...@git.apache.org>.
Github user kellyzly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r28030065
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -123,12 +130,34 @@ private[spark] class DiskBlockObjectWriter(
        */
       private var numRecordsWritten = 0
     
    +  private var sparkConf:SparkConf = null
    +
       override def open(): BlockObjectWriter = {
         if (hasBeenClosed) {
           throw new IllegalStateException("Writer already closed. Cannot be reopened.")
         }
         fos = new FileOutputStream(file, true)
    -    ts = new TimeTrackingOutputStream(fos)
    +    var isEncryptedShuffle: Boolean = if (sparkConf != null) {
    +      sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +    }
    +    else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    --- End diff --
    
    @vanzin : 
    > In fact you could even inline the conf.getBoolean call here.
    
    not understand, could you explain clearly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931187
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -534,12 +538,51 @@ private[spark] class Client(
         // send the acl settings into YARN to control who has access via YARN interfaces
         val securityManager = new SecurityManager(sparkConf)
         amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
    +    val isEncryptedShuffle = if (sparkConf != null) {
    --- End diff --
    
    Another instance of code that can be cleaned up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930515
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -123,12 +130,34 @@ private[spark] class DiskBlockObjectWriter(
        */
       private var numRecordsWritten = 0
     
    +  private var sparkConf:SparkConf = null
    +
       override def open(): BlockObjectWriter = {
         if (hasBeenClosed) {
           throw new IllegalStateException("Writer already closed. Cannot be reopened.")
         }
         fos = new FileOutputStream(file, true)
    -    ts = new TimeTrackingOutputStream(fos)
    +    var isEncryptedShuffle: Boolean = if (sparkConf != null) {
    --- End diff --
    
    `val`? You can also just say `sparkConf.getBoolean(...)` instead of the `if...else`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930383
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -20,10 +20,17 @@ package org.apache.spark.storage
     import java.io.{BufferedOutputStream, FileOutputStream, File, OutputStream}
     import java.nio.channels.FileChannel
     
    -import org.apache.spark.Logging
    -import org.apache.spark.serializer.{SerializationStream, Serializer}
    -import org.apache.spark.executor.ShuffleWriteMetrics
    +import org.apache.hadoop.mapreduce.security.TokenCache
     
    +import com.intel.chimera.{CipherSuite, CryptoOutputStream, CryptoCodec}
    --- End diff --
    
    nit: should be grouped with the `org.apache.hadoop` imports. See pattern in other files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931071
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -552,8 +558,47 @@ private[spark] class ApplicationMaster(
         }
       }
     
    +  /** set  "MapReduceShuffleToken" before registerAM
    +    * @return
    +    */
    +  private def initJobCredentialsAndUGI() = {
    +    val sc = sparkContextRef.get()
    +    val conf = if (sc != null) sc.getConf else sparkConf
    +    val isEncryptedShuffle = if (conf != null) {
    +      conf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    +      val credentials = SparkHadoopUtil.get.getCurrentUserCredentials
    +      if (credentials.getSecretKey(SPARK_SHUFFLE_TOKEN) == null) {
    +        var keyGen: KeyGenerator = null
    +        try {
    +          val SHUFFLE_KEY_LENGTH: Int = 64
    +          var keyLen: Int = if (conf.getBoolean(SPARK_ENCRYPTED_INTERMEDIATE_DATA,
    +            DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA) == true) {
    +            conf.getInt(SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS,
    +              DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS)
    +          }
    +          else {
    --- End diff --
    
    nit: move to previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by winningsix <gi...@git.apache.org>.
Github user winningsix commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-142516961
  
    @vanzin I create a new PR(https://github.com/apache/spark/pull/8880) addressing the common part and JCE key provider support. Could you help me review it? Ticket is also filed in SPARK-10771. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930971
  
    --- Diff: core/src/test/scala/org/apache/spark/crypto/JceAesCtrCryptoCodecSuite.scala ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.crypto
    +
    +import java.io.{ByteArrayInputStream, BufferedOutputStream, ByteArrayOutputStream}
    --- End diff --
    
    nit: sort


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141264324
  
    re: SPARK-6373, it would just add another way of encrypting shuffle traffic; personally I don't see much value in it, but maybe others disagree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931006
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -22,24 +22,28 @@ import scala.util.control.NonFatal
     import java.io.{File, IOException}
     import java.lang.reflect.InvocationTargetException
     import java.net.{Socket, URL}
    +import java.security.NoSuchAlgorithmException
     import java.util.concurrent.atomic.AtomicReference
    +import javax.crypto.{SecretKey, KeyGenerator}
    --- End diff --
    
    nit: sort


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931166
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -552,8 +558,47 @@ private[spark] class ApplicationMaster(
         }
       }
     
    +  /** set  "MapReduceShuffleToken" before registerAM
    +    * @return
    +    */
    +  private def initJobCredentialsAndUGI() = {
    +    val sc = sparkContextRef.get()
    +    val conf = if (sc != null) sc.getConf else sparkConf
    +    val isEncryptedShuffle = if (conf != null) {
    +      conf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    +      val credentials = SparkHadoopUtil.get.getCurrentUserCredentials
    +      if (credentials.getSecretKey(SPARK_SHUFFLE_TOKEN) == null) {
    +        var keyGen: KeyGenerator = null
    +        try {
    +          val SHUFFLE_KEY_LENGTH: Int = 64
    +          var keyLen: Int = if (conf.getBoolean(SPARK_ENCRYPTED_INTERMEDIATE_DATA,
    +            DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA) == true) {
    +            conf.getInt(SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS,
    +              DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS)
    +          }
    +          else {
    +            SHUFFLE_KEY_LENGTH
    +          }
    +          val SHUFFLE_KEYGEN_ALGORITHM = "HmacSHA1";
    +          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM)
    +          keyGen.init(keyLen)
    +        }
    +        catch {
    +          case e: NoSuchAlgorithmException => println("Error generating shuffle secret key")
    --- End diff --
    
    Two things here. You shouldn't use `println` for errors. And if you actually hit an error here, you'll cause an NPE at line 593. So you probably should let this exception bubble up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141262377
  
    And to be clear, I do think we can close https://issues.apache.org/jira/browse/SPARK-6373, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by kellyzly <gi...@git.apache.org>.
Github user kellyzly commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-91146737
  
    @vanzin : 
    > Left some initial comments. I like the idea of using a library for this, my main concern is about the stability of that library (and who'll be maintaining it going forward).
    
     i  have updated the code according to your valuable suggestions.  [Chimera](https://github.com/intel-hadoop/chimera) is a project which is maintained by Intel team(My teammate dianfu is the contributor of it).  Chimera is a project which strips code related to the encryption and decryption part of   [Transparent Encryption in HDFS](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html) from hadoop.  Transparent Encryption is imported since hadoop 2.6 and the code is stable after hadoop 2.6 release. Although the current version of Chimera is 0.0.1, it is same stable as current hadoop because we just strip code from current trunk of hadoop. If someone reports new bugs to hadoop about the encryption and decryption part later, my teammate will  watch it and patch the patches of those bugs to Chimera. 
    
    More information about encrypted shuffle you can reference https://issues.apache.org/jira/browse/SPARK-5682. Any suggestions/advices/guidance are welcome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by kellyzly <gi...@git.apache.org>.
Github user kellyzly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r28029243
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
    @@ -123,12 +130,34 @@ private[spark] class DiskBlockObjectWriter(
        */
       private var numRecordsWritten = 0
     
    +  private var sparkConf:SparkConf = null
    +
       override def open(): BlockObjectWriter = {
         if (hasBeenClosed) {
           throw new IllegalStateException("Writer already closed. Cannot be reopened.")
         }
         fos = new FileOutputStream(file, true)
    -    ts = new TimeTrackingOutputStream(fos)
    +    var isEncryptedShuffle: Boolean = if (sparkConf != null) {
    --- End diff --
    
    @vanzin : i will change "var" to "val".  Use if ... else here is to pass [ the unit test ("Reopening a closed block writer") of BlockObjectWriterSuite ](https://github.com/apache/spark/blob/dcd1e42d6b6ac08d2c0736bf61a15f515a1f222b/core/src/test/scala/org/apache/spark/storage/BlockObjectWriterSuite.scala#L80). SparkConf is null in that unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930234
  
    --- Diff: core/pom.xml ---
    @@ -359,6 +359,12 @@
           <artifactId>py4j</artifactId>
           <version>0.8.2.1</version>
         </dependency>
    +    <dependency>
    +      <groupId>com.intel.chimera</groupId>
    +      <artifactId>chimera</artifactId>
    +      <version>0.0.1</version>
    +      <type>jar</type>
    --- End diff --
    
    This is not generally needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931080
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -552,8 +558,47 @@ private[spark] class ApplicationMaster(
         }
       }
     
    +  /** set  "MapReduceShuffleToken" before registerAM
    +    * @return
    +    */
    +  private def initJobCredentialsAndUGI() = {
    +    val sc = sparkContextRef.get()
    +    val conf = if (sc != null) sc.getConf else sparkConf
    +    val isEncryptedShuffle = if (conf != null) {
    +      conf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    +      val credentials = SparkHadoopUtil.get.getCurrentUserCredentials
    +      if (credentials.getSecretKey(SPARK_SHUFFLE_TOKEN) == null) {
    +        var keyGen: KeyGenerator = null
    +        try {
    +          val SHUFFLE_KEY_LENGTH: Int = 64
    +          var keyLen: Int = if (conf.getBoolean(SPARK_ENCRYPTED_INTERMEDIATE_DATA,
    +            DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA) == true) {
    +            conf.getInt(SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS,
    +              DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS)
    +          }
    +          else {
    +            SHUFFLE_KEY_LENGTH
    +          }
    +          val SHUFFLE_KEYGEN_ALGORITHM = "HmacSHA1";
    +          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM)
    +          keyGen.init(keyLen)
    +        }
    +        catch {
    --- End diff --
    
    nit: move to previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930341
  
    --- Diff: core/src/main/scala/org/apache/spark/crypto/CommonConfigurationKeys.scala ---
    @@ -0,0 +1,35 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.crypto
    +
    +import org.apache.hadoop.io.Text
    +
    +/**
    + * Constant variables
    + */
    +object CommonConfigurationKeys {
    --- End diff --
    
    `private[spark]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-150726069
  
    @kellyzly given #8880, could you close this PR? I assume a new one will be built on to of that one to add native acceleration later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931054
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -552,8 +558,47 @@ private[spark] class ApplicationMaster(
         }
       }
     
    +  /** set  "MapReduceShuffleToken" before registerAM
    +    * @return
    +    */
    +  private def initJobCredentialsAndUGI() = {
    +    val sc = sparkContextRef.get()
    +    val conf = if (sc != null) sc.getConf else sparkConf
    +    val isEncryptedShuffle = if (conf != null) {
    +      conf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    --- End diff --
    
    Another instance of this code that could use some cleaning up.
    
    I wonder if it would make sense to have a `CryptoConf` object that wraps a `SparkConf` (like `SSLOptions`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27930846
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ---
    @@ -17,19 +17,24 @@
     
     package org.apache.spark.storage
     
    -import java.io.{InputStream, IOException}
    +import java.io.InputStream
     import java.util.concurrent.LinkedBlockingQueue
     
     import scala.collection.mutable.{ArrayBuffer, HashSet, Queue}
     import scala.util.{Failure, Success, Try}
     
    -import org.apache.spark.{Logging, TaskContext}
    +import com.intel.chimera.{CipherSuite, CryptoInputStream, CryptoCodec}
    +
    +import org.apache.spark.crypto.CommonConfigurationKeys.DEFAULT_SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB
    +import org.apache.spark.crypto.CommonConfigurationKeys.SPARK_ENCRYPTED_INTERMEDIATE_DATA_BUFFER_KB
    +import org.apache.spark.crypto.CommonConfigurationKeys.SPARK_SHUFFLE_TOKEN
    +import org.apache.spark.deploy.SparkHadoopUtil
     import org.apache.spark.network.BlockTransferService
     import org.apache.spark.network.shuffle.{BlockFetchingListener, ShuffleClient}
     import org.apache.spark.network.buffer.ManagedBuffer
     import org.apache.spark.serializer.Serializer
     import org.apache.spark.util.{CompletionIterator, Utils}
    -
    +import org.apache.spark.{Logging, TaskContext}
    --- End diff --
    
    nit: same thing about ordering, leaving empty line afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-142066882
  
    @vanzin, given that you've been spending a lot of time on encryption-related things in Spark, I'll leave it to you to make a judgement call here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5307#discussion_r27931279
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
    @@ -534,12 +538,51 @@ private[spark] class Client(
         // send the acl settings into YARN to control who has access via YARN interfaces
         val securityManager = new SecurityManager(sparkConf)
         amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
    +    val isEncryptedShuffle = if (sparkConf != null) {
    +      sparkConf.getBoolean("spark.encrypted.shuffle", false)
    +    } else {
    +      false
    +    }
    +    if (isEncryptedShuffle) {
    +      setSparkShuffleTokens(credentials, amContainer)
    +      // amContainer.setEnvironment()  set the LD_LIBRARY_PATH like "xxx.so" here
    +    }
         setupSecurityToken(amContainer)
         UserGroupInformation.getCurrentUser().addCredentials(credentials)
     
         amContainer
       }
     
    +  def setSparkShuffleTokens(credentials:Credentials,amContainer:ContainerLaunchContext){
    --- End diff --
    
    A lot of the code here seems very similar to code in `ApplicationMaster.scala`. Could you move the similar parts to some helper function somewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-142155798
  
    @winningsix that would be awesome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [Spark-5682] Add spark encrypted shuffle by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/5307#issuecomment-141255933
  
    Spark supports encrypted communication for shuffle data; we should fix the docs (I'll file a bug for that). But this PR is not about on-the-wire encryption, it's data at rest encryption (i.e. the shuffle files on disk).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org