You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by bomeng <gi...@git.apache.org> on 2018/06/25 20:08:03 UTC

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

GitHub user bomeng opened a pull request:

    https://github.com/apache/spark/pull/21638

    [SPARK-22357][CORE] SparkContext.binaryFiles ignore minPartitions parameter

    ## What changes were proposed in this pull request?
    Fix the issue that minPartitions was not used in the method.
    
    ## How was this patch tested?
    I have not provided the additional test since the fix is very straightforward.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bomeng/spark 22357

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21638
    
----
commit b9eea4994c3ad151aa75ed03bbcf807bc3c4ded8
Author: Bo Meng <me...@...>
Date:   2018-06-25T20:02:43Z

    fix: SparkContext.binaryFiles ignore minPartitions parameter

commit 0fc35d4e0db34239cd3c52b0cf21445c59d2dede
Author: Bo Meng <me...@...>
Date:   2018-06-25T20:04:58Z

    should be max()

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #4290 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4290/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Except for `binaryFiles`, everything else that needs to change is private to Spark. I know it's public in the bytecode, but only Java callers could accidentally exploit that. Still I don't personally care too much either way, as long as all the unused args are documented, I guess, for completeness.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Here is the test code, not sure it is right or not --- 
    ```
      test("Number of partitions") {
        sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")
          .set("spark.files.maxPartitionBytes", "10")
          .set("spark.files.openCostInBytes", "0")
          .set("spark.default.parallelism", "1"))
    
        val dir1 = Utils.createTempDir()
        val dirpath1 = dir1.getAbsolutePath
        val dir2 = Utils.createTempDir()
        val dirpath2 = dir2.getAbsolutePath
    
        val file1 = new File(dir1, "part-00000")
        val file2 = new File(dir1, "part-00001")
    
        Files.write("someline1 in file1\nsomeline2 in file1\nsomeline3 in file1", file1,
          StandardCharsets.UTF_8)
        Files.write("someline1 in file2\nsomeline2 in file2\nsomeline3 in file2", file2,
          StandardCharsets.UTF_8)
    
        assert(sc.binaryFiles(dirpath1, minPartitions = 1).getNumPartitions == 2)
        assert(sc.binaryFiles(dirpath1, minPartitions = 2).getNumPartitions == 2)
        assert(sc.binaryFiles(dirpath1, minPartitions = 50).getNumPartitions == 2)
      }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #93067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93067/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r203589083
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    I metioned `BinaryFileRDD` not this method, you can check the code to see how it handles the default value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r197988075
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -45,7 +45,8 @@ private[spark] abstract class StreamFileInputFormat[T]
        * which is set through setMaxSplitSize
        */
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
    -    val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    +    val defaultMaxSplitBytes = Math.max(
    +      sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
         val defaultParallelism = sc.defaultParallelism
    --- End diff --
    
    hmm, shouldn't `minPartitions` be used like this?
    
    ```scala
    val defaultParallelism = Math.max(sc.defaultParallelism, if (minPartitions == 0) 1 else minPartitions)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21638


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #93067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93067/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r202907829
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    you need to pass in the minPartitions to use this method, what do you mean minParititions is not set? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Because this method is internal to Spark, why not just take out the parameter? Yes it's superfluous now, but it's been this way for a while, and seems perhaps better to avoid a behavior change. In fact you can pull a `minPartitions` parameter out of several private methods then. You can't remove the parameter to `binaryFiles`, sure, but it can be documented as doing nothing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #92350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92350/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r214581248
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    BTW, it is easy to add such a test case. We can even test the behaviors of the boundary cases. cc @srowen @HyukjinKwon @MaxGekk @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2577/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r204517923
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    BinaryFileRDD will set minPartitions, which will either be defaultMinPartitions, or the values you can set via binaryFiles(path, minPartitions) method. Eventually, this minPartitions value will be passed to setMinPartitions() method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    It seems there is similar code there: https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L424-L433 . Should it be changed in the same way?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95295/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r215030825
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    ```
          sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")
            .set(config.FILES_OPEN_COST_IN_BYTES.key, "0")
            .set("spark.default.parallelism", "1"))
    
          println(sc.binaryFiles(dirpath1, minPartitions = 50).getNumPartitions)
          println(sc.binaryFiles(dirpath1, minPartitions = 1).getNumPartitions)
    ```
    
    It is not hard to verify whether the parameter `minPartitions` takes an effect. Currently, the description of this parameter is not clear. We need to document it clear which factors impact the actual number of partitions; otherwise, users will not understand how to use it. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r198120457
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -45,7 +45,8 @@ private[spark] abstract class StreamFileInputFormat[T]
        * which is set through setMaxSplitSize
        */
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
    -    val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    +    val defaultMaxSplitBytes = Math.max(
    +      sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
         val defaultParallelism = sc.defaultParallelism
    --- End diff --
    
    Could you describe the use case when you need to take into account `minPartitions`. By default, `FILES_MAX_PARTITION_BYTES` is 128MB. Let's say it is even set to 1000, and `minPartitions` equals to 10 000. What is the reason to set the max size of splits in **bytes** to the min **number** of partition. Why should bigger number of partitions require bigger split size? Could you add more details to the PR description, please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    @HyukjinKwon please review. thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/490/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Yea, it's internal to Spark. Might be good to keep it but that concern should be secondary IMHO.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #95295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95295/testReport)** for PR 21638 at commit [`5e46efb`](https://github.com/apache/spark/commit/5e46efb5f5ce86297c4aeb23bf934fd9942de3de).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r214685953
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    I think it's hard to test, technically, because `setMinPartitions` is only a hint. In the case of `binaryFiles` we know it will put a hard limit on the number of partitions, but it isn't true of other implementations. We can still make a simple test for all of these, it just may be asserting behavior that could change in the future in Hadoop, though I strongly doubt it would.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Either way works for me, but I think since this is not a private method, so people may use it in their own approach. The minimal change will be the best. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Not sure yet but let's leave that out of this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/464/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92350/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r214581076
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    We should have a test case; otherwise, we could hit the same issue again. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #92309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92309/testReport)** for PR 21638 at commit [`0fc35d4`](https://github.com/apache/spark/commit/0fc35d4e0db34239cd3c52b0cf21445c59d2dede).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r198243125
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -45,10 +45,9 @@ private[spark] abstract class StreamFileInputFormat[T]
        * which is set through setMaxSplitSize
        */
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
    -    val defaultMaxSplitBytes = Math.max(
    -      sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions)
    +    val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    Now it makes much more sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    @bomeng Could you submit a follow-up PR to add a test case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Yea, let's add a regression test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r215022562
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    From the codes, you can see the calculation is just the intermediate result and this method won't return any value. Checking the split size does not make sense for this test case because it depends on multiple variables and this is just one of them.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #4290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4290/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #92350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92350/testReport)** for PR 21638 at commit [`c24fbe5`](https://github.com/apache/spark/commit/c24fbe5cdf259814c30d9038fa3c35a2934ac39f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #95295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95295/testReport)** for PR 21638 at commit [`5e46efb`](https://github.com/apache/spark/commit/5e46efb5f5ce86297c4aeb23bf934fd9942de3de).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92309/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Ideally the last test should have 50 partitions? is it because we really need the test data to be at least 50 bytes? ideally a multiple of 50, I guess.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    **[Test build #92309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92309/testReport)** for PR 21638 at commit [`0fc35d4`](https://github.com/apache/spark/commit/0fc35d4e0db34239cd3c52b0cf21445c59d2dede).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r215010040
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    I agree it is hard to test. I appreciate If anyone can give me some hints of how to do these (how to verify and where to put my test cases). 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r202737100
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    If `sc.defaultParallelism` < 2, and `minParititions` is not set in `BinaryFileRDD`, then previously `defaultParallelism` shall be the same as `sc.defaultParallelism`, and after the change it will be `2`. Have you already consider this case and feel it's right behavior change to make?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r215016744
  
    --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
         val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    Would you mind following up with a test that just asserts that asking for, say, 20 partitions results in 20 partitions? This is technically too specific as a test, but is probably fine for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/983/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org