You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vgankidi <gi...@git.apache.org> on 2017/11/01 20:13:27 UTC

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

GitHub user vgankidi opened a pull request:

    https://github.com/apache/spark/pull/19633

    [SPARK-22411][SQL] Disable the heuristic to calculate max partition size when  dynamic allocation is enabled and use the value specified by the property spark.sql.files.maxPartitionBytes instead

    
    ## What changes were proposed in this pull request?
    
    The heuristic to calculate the maxSplitSize in DataSourceScanExec is as follows:
    https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
    Default parallelism in this case is the number of total cores of all the registered executors for this application. This works well with static allocation but with dynamic allocation enabled, this value is usually one (with default config of min and initial executors as zero) at the time of split calculation. This heuristic was introduced in SPARK-14582.
    When Dynamic allocation it is confusing to tune the split size with this heuristic. It is better to ignore bytesPerCore and use the values of 'spark.sql.files.maxPartitionBytes' as the max split size.
    
    ## How was this patch tested?
    Tested manually.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vgankidi/spark SPARK-22411

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19633.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19633
    
----
commit 4157771715a235fe5ffad970764b805fd74f45d5
Author: Vinitha Gankidi <vg...@netflix.com>
Date:   2017-11-01T20:09:44Z

    [SPARK-22411][SQL] Disable the heuristic to calculate max partition size when
    dynamic allocation is enabled and use the value specified by the property
    spark.sql.files.maxPartitionBytes instead

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #93053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93053/testReport)** for PR 19633 at commit [`d3aff40`](https://github.com/apache/spark/commit/d3aff40637da077f4f9cee655d6fb24b0b204a5d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #83756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83756/testReport)** for PR 19633 at commit [`d3aff40`](https://github.com/apache/spark/commit/d3aff40637da077f4f9cee655d6fb24b0b204a5d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #93053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93053/testReport)** for PR 19633 at commit [`d3aff40`](https://github.com/apache/spark/commit/d3aff40637da077f4f9cee655d6fb24b0b204a5d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    ping @davies 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r151308496
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    >For small data, sometimes users care about the number of output files generated as well.
    
    Can you please elaborate more, do you want more output files or less output files. if less, I think I already mentioned in the above comment.
    
    Seems you patch already ignored `bytesPerCore` when dynamic allocation is enabled, are you suggesting something different? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    @gatorsmile Can you please take a look? I'd like to hear your thoughts on this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    How about using spark.dynamicAllocation.maxExecutors for calculating bytesPerCore when dynamic allocation is enabled?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150431620
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    That is true. In that case, this change wouldn't help. That is why I asked if using `spark.dynamicAllocation.maxExecutors` for calculating bytesPerCore is better when dynamic allocation is enabled. 
    
    For small tables (when `bytesPerCore` is less than `defaultMaxSplitBytes`) we have had use cases where we wanted control over the partition size without being limited by `bytesPerCore`. We internally added a flag to ignore `bytesPerCore`. 
    
    Reasoning about split size in terms of the two parameters 'defaultMaxSplitBytes' and 'openCostInBytes' seems to be more intuitive. What do you think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150489026
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    cc @jerryshao 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #83736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83736/testReport)** for PR 19633 at commit [`4157771`](https://github.com/apache/spark/commit/4157771715a235fe5ffad970764b805fd74f45d5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93053/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150746876
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    What if `spark.dynamicAllocation.maxExecutors` is not configured? Seems we cannot rely on this configuration, user may not always set it.
    
    My concern is the cost of ramping up new executors, by splitting partitions into small ones, Spark will ramp up more executors to execute small tasks, when the cost of ramping up new executors is larger than executing tasks, this seems not a heuristic solution anymore. Previously because all the executors are available, so heuristic solution is valid.
    
    For small data (calculated `bytesPerCore ` < `defaultMaxSplitBytes `, less than 128M), I think using the available resources to schedule tasks would be enough, since task is not so big. For big data (calculated `bytesPerCore ` > `defaultMaxSplitBytes `, larger than 128M), I think 128M might be the proper value to issue new executors and tasks. So IMHO seems current solution is sufficient for dynamic allocation scenario. Please correct me if I'm wrong.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Generally, this looks reasonable to me. cc @cloud-fan @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83756/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150403066
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,18 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +      defaultMaxSplitBytes
    --- End diff --
    
    Also please add a comment here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r151236188
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    @jerryshao That's true. We cannot rely on users to set it.
    
    For small data, sometimes users care about the number of output files generated as well. To give the flexibility to the users, how about adding a flag to ignore `bytesPerCore` when set regardless of dynamic allocation being used or not? That way the default behavior stays as is and users with specific needs can set the flag. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150429971
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    does it really help? Without this change, `bytesPerCore` is usually very big and `maxSplitBytes` will be just `defaultMaxSplitBytes`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #83736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83736/testReport)** for PR 19633 at commit [`4157771`](https://github.com/apache/spark/commit/4157771715a235fe5ffad970764b805fd74f45d5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83736/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by vgankidi <gi...@git.apache.org>.

Github user vgankidi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r153957431
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +        defaultMaxSplitBytes
    --- End diff --
    
    @jerryshao When `bytesPerCore` is less than `defaultMaxSplitBytes` and  is chosen as the `maxSplitsBytes` instead of `defaultMaxSplitBytes`, the number of output files is greater than what it would be when `maxSplitBytes`=`defaultMaxSplitBytes`. So for use cases against small data where users want fewer output files  they should have the flexibility to do so.
    
    Instead of ignoring `bytesPerCore` when dynamic allocation is enabled I am proposing adding a new flag `overrideMaxPartitionSize`. When this is set, `maxSplitBytes` would be equal to `defaultMaxSplitBytes` (i.e, the current heuristic will not be used). The default value of this flag will be false, so that the current behavior is not affected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19633
  
    **[Test build #83756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83756/testReport)** for PR 19633 at commit [`d3aff40`](https://github.com/apache/spark/commit/d3aff40637da077f4f9cee655d6fb24b0b204a5d).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150403060
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -424,11 +424,18 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    +    val maxSplitBytes =
    +      if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) {
    +      defaultMaxSplitBytes
    --- End diff --
    
    indents.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org