You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2017/05/18 16:09:28 UTC

[GitHub] spark pull request #18031: Record accurate size of blocks in MapStatus when ...

GitHub user jinxing64 opened a pull request:

https://github.com/apache/spark/pull/18031

Record accurate size of blocks in MapStatus when it's above threshold.

## What changes were proposed in this pull request?

Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold.

## How was this patch tested?

Added test in MapStatusSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinxing64/spark SPARK-20801

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18031.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18031

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117460170
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    --- End diff --
    
    @wzhfy 
    Thanks for taking time review this :) 
    This pr is based on the discussion in #16989 . The idea is to avoid underestimating big blocks in `HighlyCompressedStatus` and control the size of `HighlyCompressedStatus` at the same time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117510051
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    --- End diff --
    
    Yes, this is to avoid the OOM. To adjust the value of this config, user needs to be sophisticated. I agree that these two configs are difficult. But with the default setting, we can really avoid some OOM situations(e.g. super huge block when skew happens).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77058/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77182/testReport)** for PR 18031 at commit [`66aa56f`](https://github.com/apache/spark/commit/66aa56fd4d23645ec1d2f0e253ea4888b4882f9d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77069/testReport)** for PR 18031 at commit [`f6670d8`](https://github.com/apache/spark/commit/f6670d8a77be52ce510e26d326b9b57c7ff4cc9b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77072/testReport)** for PR 18031 at commit [`bfea9f5`](https://github.com/apache/spark/commit/bfea9f59fd7587b87de0ddb4601f76786671f38a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77166/testReport)** for PR 18031 at commit [`a313744`](https://github.com/apache/spark/commit/a313744bc84e6d869f7e66a5d704e07106d02e64).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117491182
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    --- End diff --
    
    But I agree that by this pr, at lease we have a workaround for oom problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117461089
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    --- End diff --
    
    Yes, the case you mentioned above is a really good one. But setting `spark.shuffle.accurateBlockThreshold`  means we can accept sacrificing accuracy of blocks smaller than `spark.shuffle.accurateBlockThreshold`. If we want it to be more accurate, set it larger(in this case we can set it 50M). Thus size of the big bucket will be accurate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77182/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77182/testReport)** for PR 18031 at commit [`66aa56f`](https://github.com/apache/spark/commit/66aa56fd4d23645ec1d2f0e253ea4888b4882f9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77171/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77169 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77169/testReport)** for PR 18031 at commit [`46fba08`](https://github.com/apache/spark/commit/46fba08c2621e4f8f762a9bc87ce51c659b247bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    @cloud-fan 
    Thanks for merging !
    @mridulm @JoshRosen @viirya @HyukjinKwon @wzhfy Thanks a lot for taking time reviewing this pr !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77166/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77171/testReport)** for PR 18031 at commit [`ca65544`](https://github.com/apache/spark/commit/ca65544fb9d31bcccf51bf360218af9e8eafdf94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77166/testReport)** for PR 18031 at commit [`a313744`](https://github.com/apache/spark/commit/a313744bc84e6d869f7e66a5d704e07106d02e64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117440204
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
     }
     
     /**
    - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    - * plus a bitmap for tracking which blocks are empty.
    + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger
    + * than both spark.shuffle.accurateBlockThreshold and
    + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the
    + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.
      *
      * @param loc location where the task is being executed
      * @param numNonEmptyBlocks the number of non-empty blocks
      * @param emptyBlocks a bitmap tracking which blocks are empty
      * @param avgSize average size of the non-empty blocks
    + * @param hugeBlockSizes sizes of huge blocks by their reduceId.
      */
     private[spark] class HighlyCompressedMapStatus private (
         private[this] var loc: BlockManagerId,
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
    -    private[this] var avgSize: Long)
    +    private[this] var avgSize: Long,
    +    @transient private var hugeBlockSizes: Map[Int, Byte])
    --- End diff --
    
    Can the size of this map become large?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77070/testReport)** for PR 18031 at commit [`970421b`](https://github.com/apache/spark/commit/970421b2a5cb2278d60403f72dc165418e4faf87).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117625287
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
     }
     
     /**
    - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    - * plus a bitmap for tracking which blocks are empty.
    + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger
    + * than both spark.shuffle.accurateBlockThreshold and
    + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the
    + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.
      *
      * @param loc location where the task is being executed
      * @param numNonEmptyBlocks the number of non-empty blocks
      * @param emptyBlocks a bitmap tracking which blocks are empty
      * @param avgSize average size of the non-empty blocks
    + * @param hugeBlockSizes sizes of huge blocks by their reduceId.
      */
     private[spark] class HighlyCompressedMapStatus private (
         private[this] var loc: BlockManagerId,
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
    -    private[this] var avgSize: Long)
    +    private[this] var avgSize: Long,
    +    @transient private var hugeBlockSizes: Map[Int, Byte])
    --- End diff --
    
    @viirya We had this discussion before in the earlier PR (which this is split from).
    maxSizeInFlight meant to control how much data can be fetched in parallel and tuned based on network throughput and not memory (though currently, they are directly dependent due to implementation detail).
    In reality, it is fairly small compared to what can be held in memory (48mb is default iirc) - since the memory and IO subsystems have different characteristics, using same config to control behavior in both will lead to suboptimal behavior (for example, large memory systems where large amounts can be held in memory, but network bandwidth is not propotionally higher).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77069/testReport)** for PR 18031 at commit [`f6670d8`](https://github.com/apache/spark/commit/f6670d8a77be52ce510e26d326b9b57c7ff4cc9b).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77056/testReport)** for PR 18031 at commit [`d5b8a21`](https://github.com/apache/spark/commit/d5b8a211affc3cbe26c1920efe54b8407c863ada).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77056/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117621035
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
     }
     
     /**
    - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    - * plus a bitmap for tracking which blocks are empty.
    + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger
    + * than both spark.shuffle.accurateBlockThreshold and
    + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the
    + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.
      *
      * @param loc location where the task is being executed
      * @param numNonEmptyBlocks the number of non-empty blocks
      * @param emptyBlocks a bitmap tracking which blocks are empty
      * @param avgSize average size of the non-empty blocks
    + * @param hugeBlockSizes sizes of huge blocks by their reduceId.
      */
     private[spark] class HighlyCompressedMapStatus private (
         private[this] var loc: BlockManagerId,
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
    -    private[this] var avgSize: Long)
    +    private[this] var avgSize: Long,
    +    @transient private var hugeBlockSizes: Map[Int, Byte])
    --- End diff --
    
    The control of `spark.reducer.maxSizeInFlight` is not a big problem. It seems to me that any blocks considered as huge should break `maxSizeInFlight` and can't be fetching in parallel. We actually don't need to know accurate size of huge blocks, we just need to know it's huge and it should be more than `maxSizeInFlight`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117605781
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    It seems to me that this https://github.com/apache/spark/pull/16989/files#r117174623  is a good comment to have accurate size for the smaller blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [WIP][SPARK-20801] Record accurate size of blocks in Map...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77169/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77069/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117610423
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
     }
     
     /**
    - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    - * plus a bitmap for tracking which blocks are empty.
    + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger
    + * than both spark.shuffle.accurateBlockThreshold and
    + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the
    + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty.
      *
      * @param loc location where the task is being executed
      * @param numNonEmptyBlocks the number of non-empty blocks
      * @param emptyBlocks a bitmap tracking which blocks are empty
      * @param avgSize average size of the non-empty blocks
    + * @param hugeBlockSizes sizes of huge blocks by their reduceId.
      */
     private[spark] class HighlyCompressedMapStatus private (
         private[this] var loc: BlockManagerId,
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
    -    private[this] var avgSize: Long)
    +    private[this] var avgSize: Long,
    +    @transient private var hugeBlockSizes: Map[Int, Byte])
    --- End diff --
    
    Yes, I think it makes sense to add bitmap for hugeBlocks. But I'm a little bit hesitant. I still prefer to have `hugeBlockSizes` more independent from upper logic. In addition, the accurate size of blocks can also have positive effect on pending requests. (e.g. `spark.reducer.maxSizeInFlight` can control the size of pending requests better.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117663622
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -178,11 +203,20 @@ private[spark] object HighlyCompressedMapStatus {
         // we expect that there will be far fewer of them, so we will perform fewer bitmap insertions.
         val emptyBlocks = new RoaringBitmap()
         val totalNumBlocks = uncompressedSizes.length
    +    val threshold = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
         while (i < totalNumBlocks) {
           var size = uncompressedSizes(i)
           if (size > 0) {
             numNonEmptyBlocks += 1
    -        totalSize += size
    +        // Remove the huge blocks from the calculation for average size and have accurate size for
    +        // smaller blocks.
    +        if (size > threshold) {
    +          totalSize += size
    --- End diff --
    
    this should be put in the else branch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    In current change:
    1. there's only one config: spark.shuffle.accurateBlockThreshold
    2. I remove the huge blocks from the numerator in that calculation for average size


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117633506
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    +1 for one flag, let's only keep `spark.shuffle.accurateBlockThreshold`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77070/testReport)** for PR 18031 at commit [`970421b`](https://github.com/apache/spark/commit/970421b2a5cb2278d60403f72dc165418e4faf87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117620188
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    With the default value (spark.shuffle.accurateBlockThreshold=100M and spark.shuffle.accurateBlockThresholdByTimesAverage=2), Yes.
    But the user can make it more strict by setting  (spark.shuffle.accurateBlockThreshold=0 and spark.shuffle.accurateBlockThresholdByTimesAverage=1). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77072/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    @HyukjinKwon 
    Thank you so much ! Really helpful 👍 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    I try to give user a way to control the memory strictly and no blocks are underestimated(setting spark.shuffle.accurateBlockThreshold=0 and spark.shuffle.accurateBlockThresholdByTimesAverage=1). I'm a little bit hesitant to remove the huge blocks from the numerator in that calculation for average size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    thanks, merging to master/2.2!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117443085
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    --- End diff --
    
    Suppose each map task produces a 90MB bucket and many small buckets (skew data), then avgSize is very small, and threshold would be 100MB. If the number of map tasks is large, OOM can still happen, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117663655
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -178,11 +203,20 @@ private[spark] object HighlyCompressedMapStatus {
         // we expect that there will be far fewer of them, so we will perform fewer bitmap insertions.
         val emptyBlocks = new RoaringBitmap()
         val totalNumBlocks = uncompressedSizes.length
    +    val threshold = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
         while (i < totalNumBlocks) {
           var size = uncompressedSizes(i)
    --- End diff --
    
    not related, why this is a `var`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117610528
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    In current change, if almost all blocks are huge, that's said it is not a skew case, so we won't mark the blocks as huge ones. Then we will still fetch them into memory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18031


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77058/testReport)** for PR 18031 at commit [`d5b8a21`](https://github.com/apache/spark/commit/d5b8a211affc3cbe26c1920efe54b8407c863ada).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: Record accurate size of blocks in MapStatus when it's ab...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77070/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by jinxing64 <gi...@git.apache.org>.

Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117610285
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    @viirya Thanks a lot for taking time looking into this pr :)
    
    > remove the huge blocks from the numerator in that calculation so that you more accurately size the smaller blocks
    
    Yes, I think this is really good idea to have accurate size for smaller blocks. But I'm proposing two configs(`spark.shuffle.accurateBlockThreshold` and `spark.shuffle.accurateBlockThresholdByTimesAverage` ) in current change, I have to compute the average twice: 1) the average calculated including huge blocks, thus I can filter out the huge blocks 2) the average calculated without huge blocks, thus I can have accurate size for the smaller blocks. A little bit complicated, right? How about remove the `spark.shuffle.accurateBlockThresholdByTimesAverage` ? Thus we can simplify the logic. @cloud-fan Any ideas about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18031: [SPARK-20801] Record accurate size of blocks in MapStatu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18031
  
    **[Test build #77165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77165/testReport)** for PR 18031 at commit [`94fa7bb`](https://github.com/apache/spark/commit/94fa7bb3dbcf50126afbed5c38609b93ae91f96e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18031#discussion_r117620949
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    +    val threshold1 = Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
    +    val threshold2 = avgSize * Option(SparkEnv.get)
    +      .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
    +      .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
    +    val threshold = math.max(threshold1, threshold2)
    +    val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
    +    if (numNonEmptyBlocks > 0) {
    +      i = 0
    +      while (i < totalNumBlocks) {
    +        if (uncompressedSizes(i) > threshold) {
    +          hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i)))
    +
    +        }
    +        i += 1
    +      }
    +    }
         emptyBlocks.trim()
         emptyBlocks.runOptimize()
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize,
    --- End diff --
    
    I'd tend to have just one flag and simplify the configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org