You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2017/03/13 13:45:40 UTC

[GitHub] spark pull request #17276: [WIP][SPARK-19937] Collect metrics of block sizes...

GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/17276

    [WIP][SPARK-19937] Collect metrics of block sizes when shuffle.

    ## What changes were proposed in this pull request?
    
    Metrics of blocks sizes(when shuffle) should be collected for later analysis. This is helpful for analysis when skew situations or OOM happens(though maxBytesInFlight is set).
    
    This pr proposes to:
    1. Store the distribution of sizes in `MapStatus` and count the block sizes in ranges
    [0, 1k), [1k, 10k), [10k, 100k), [100k, 1m), [1m, 10m), [10m, 100m), [100m, 1g), [1g, 10g), [10g, Long.MaxValue).
    2. Record the inaccuracy of block sizes. Because `HighlyCompressedMapStatus` is returned and only average size is recorded when block sizes is over 2000.
    
    ## How was this patch tested?
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-19937

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17276
    
----
commit 430ec95291393f49096fc07df98c15e700f23e8c
Author: jinxing <ji...@126.com>
Date:   2017-03-13T13:35:51Z

    [SPARK-19937] Collect metrics of block sizes when shuffle.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75145/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75220/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74515/testReport)** for PR 17276 at commit [`e2e56d3`](https://github.com/apache/spark/commit/e2e56d30319634c0307e4b916f79aa29ed0239a0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @jinxing64 
    If the intent behind these metrics is to help with SPARK-19659, it would be good to either add it as part of SPARK-19659 or subsequently (once the feature is merged).
    This ensures that the metrics added are actually relevant to the existing spark core, and not a future expected evolution of the code - for example, the review of SPARK-19659 might significantly change its design/implementation : making some of these either irrelevant or require other more informative metrics to be introduced.
    
    I am unclear about the intention btw - do you expect shuffle reads to be informed by metrics from the mapper side ? I probably got that wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74509/testReport)** for PR 17276 at commit [`d0932ed`](https://github.com/apache/spark/commit/d0932ed6ba0dfe463d1e77e9cae79351d233effb).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74509/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75163/testReport)** for PR 17276 at commit [`0efa348`](https://github.com/apache/spark/commit/0efa348d1beccf76da702d06ee77c8aac2aebb12).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @mridulm 
    Sorry for late reply.  I opened the pr for SPARK-19659(https://github.com/apache/spark/pull/16989) and make these two PRs independent. Basically this pr is is to evaluate the performance(blocks are shuffled to disk) and stability(size in `MapStatus` is inaccurate and OOM can happen) of the implementation proposed in SPARK-19659.
    I'd be so thankful if you have time to  comment on these two PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74448/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74938 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74938/testReport)** for PR 17276 at commit [`e6091b6`](https://github.com/apache/spark/commit/e6091b69271ade48a4eaaf3c770203f35ab06001).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108103956
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java ---
    @@ -169,6 +173,36 @@ public void write(Iterator<Product2<K, V>> records) throws IOException {
           }
         }
         mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    +    if (mapStatus instanceof HighlyCompressedMapStatus) {
    +      HighlyCompressedMapStatus hc = (HighlyCompressedMapStatus) mapStatus;
    +      long underestimatedBlocksSize = 0L;
    +      for (int i = 0; i < partitionLengths.length; i++) {
    +        if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +          underestimatedBlocksSize += partitionLengths[i];
    +        }
    +      }
    +      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
    +      if (logger.isDebugEnabled() && partitionLengths.length > 0) {
    +        int underestimatedBlocksNum = 0;
    +        // Distribution of sizes in MapStatus.
    +        double[] cp = new double[partitionLengths.length];
    +        for (int i = 0; i < partitionLengths.length; i++) {
    +          cp[i] = partitionLengths[i];
    +          if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +            underestimatedBlocksNum++;
    +          }
    +        }
    +        Distribution distribution = new Distribution(cp, 0, cp.length);
    +        double[] probabilities = {0.0, 0.25, 0.5, 0.75, 1.0};
    +        String distributionStr = distribution.getQuantiles(probabilities).mkString(", ");
    +        logger.debug("For task {}.{} in stage {} (TID {}), the block sizes in MapStatus are " +
    +          "inaccurate (average is {}, {} blocks underestimated, size of underestimated is {})," +
    +          " distribution at the given probabilities(0, 0.25, 0.5, 0.75, 1.0) is {}.",
    +          taskContext.partitionId(), taskContext.attemptNumber(), taskContext.stageId(),
    +          taskContext.taskAttemptId(), hc.getAvgSize(),
    +          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
    +      }
    +    }
    --- End diff --
    
    The value is not accurate - it is a 1og 1.1 'compression' which converts the long size to a byte : and caps the value at 255.
    
    So there are two errors introduced; it over-estimates the actual block size when compressed value < 255 [1] (which is something this PR currently ignores), when block size goes above 34k mb or so, it under estimates the block size (which is higher than what spark currently supports due to 2G limitation).
    
    
    [1] I did not realize it always over-estimates; if the current PR is targetting only blocks which are under estimated; I would agree that not handling `CompressedMapStatus` for time being might be ok - though would be good to add a comment to that effect on 'why' we dont need to handle it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74607/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75220/testReport)** for PR 17276 at commit [`c26ea56`](https://github.com/apache/spark/commit/c26ea561579a76f3370fc2ebc07314e788dc32a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [WIP][SPARK-19937] Collect metrics of block sizes...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 closed the pull request at:

    https://github.com/apache/spark/pull/17276


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74981/testReport)** for PR 17276 at commit [`d86f985`](https://github.com/apache/spark/commit/d86f985fb63204715189c1e1a964f127c396df59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107157788
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -164,6 +164,8 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
       }
    +
    +  def getAvgSize(): Long = avgSize
    --- End diff --
    
    scala convention -- simple getters don't have parens, `def getAvgSize: Long = avgSize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108052758
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java ---
    @@ -228,6 +230,36 @@ void closeAndWriteOutput() throws IOException {
           }
         }
         mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    +    if (mapStatus instanceof HighlyCompressedMapStatus) {
    +      HighlyCompressedMapStatus hc = (HighlyCompressedMapStatus) mapStatus;
    +      long underestimatedBlocksSize = 0L;
    +      for (int i = 0; i < partitionLengths.length; i++) {
    +        if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +          underestimatedBlocksSize += partitionLengths[i];
    +        }
    +      }
    +      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
    +      if (logger.isDebugEnabled() && partitionLengths.length > 0) {
    +        int underestimatedBlocksNum = 0;
    +        // Distribution of sizes in MapStatus.
    +        double[] cp = new double[partitionLengths.length];
    +        for (int i = 0; i < partitionLengths.length; i++) {
    +          cp[i] = partitionLengths[i];
    +          if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +            underestimatedBlocksNum++;
    +          }
    +        }
    +        Distribution distribution = new Distribution(cp, 0, cp.length);
    +        double[] probabilities = {0.0, 0.25, 0.5, 0.75, 1.0};
    +        String distributionStr = distribution.getQuantiles(probabilities).mkString(", ");
    +        logger.debug("For task {}.{} in stage {} (TID {}), the block sizes in MapStatus are " +
    +          "inaccurate (average is {}, {} blocks underestimated, size of underestimated is {})," +
    +          " distribution at the given probabilities(0, 0.25, 0.5, 0.75, 1.0) is {}.",
    +          taskContext.partitionId(), taskContext.attemptNumber(), taskContext.stageId(),
    +          taskContext.taskAttemptId(), hc.getAvgSize(),
    +          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
    +      }
    +    }
    --- End diff --
    
    This computation seems repeated - we should refactor it out into a method of its own and not duplicate it across classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74931/testReport)** for PR 17276 at commit [`7ac639f`](https://github.com/apache/spark/commit/7ac639f9239ef943e55a1cc76dc7346bbe617bfd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75039/testReport)** for PR 17276 at commit [`c9276c2`](https://github.com/apache/spark/commit/c9276c2051436a1b27c844c4aae87110bc39c002).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107156339
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala ---
    @@ -17,8 +17,12 @@
     
     package org.apache.spark.executor
     
    +import java.{lang => jl}
    --- End diff --
    
    the convention inside spark is to rename the java box classes w/ prefix"J", eg. `JLong`: https://demo.fluentcode.com/source/spark/master/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala?squery=JLong#L20


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75034/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75098 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75098/testReport)** for PR 17276 at commit [`6a96c3b`](https://github.com/apache/spark/commit/6a96c3b2858ee53bad073040217182c19d5e7db0).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74938 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74938/testReport)** for PR 17276 at commit [`e6091b6`](https://github.com/apache/spark/commit/e6091b69271ade48a4eaaf3c770203f35ab06001).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75098/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75034/testReport)** for PR 17276 at commit [`1720b5e`](https://github.com/apache/spark/commit/1720b5e8e1d1c142cc414ae39cb620727bb92d5f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74917/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74607 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74607/testReport)** for PR 17276 at commit [`91e338b`](https://github.com/apache/spark/commit/91e338b52c911366b9e6cd209c8f09ad04a2acca).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @mridulm @squito 
    Thanks a lot for taking time review this pr.
    I will close it for now and make another one if there is progress.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108052753
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java ---
    @@ -169,6 +173,36 @@ public void write(Iterator<Product2<K, V>> records) throws IOException {
           }
         }
         mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    +    if (mapStatus instanceof HighlyCompressedMapStatus) {
    +      HighlyCompressedMapStatus hc = (HighlyCompressedMapStatus) mapStatus;
    +      long underestimatedBlocksSize = 0L;
    +      for (int i = 0; i < partitionLengths.length; i++) {
    +        if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +          underestimatedBlocksSize += partitionLengths[i];
    +        }
    +      }
    +      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
    +      if (logger.isDebugEnabled() && partitionLengths.length > 0) {
    +        int underestimatedBlocksNum = 0;
    +        // Distribution of sizes in MapStatus.
    +        double[] cp = new double[partitionLengths.length];
    +        for (int i = 0; i < partitionLengths.length; i++) {
    +          cp[i] = partitionLengths[i];
    +          if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +            underestimatedBlocksNum++;
    +          }
    +        }
    +        Distribution distribution = new Distribution(cp, 0, cp.length);
    +        double[] probabilities = {0.0, 0.25, 0.5, 0.75, 1.0};
    +        String distributionStr = distribution.getQuantiles(probabilities).mkString(", ");
    +        logger.debug("For task {}.{} in stage {} (TID {}), the block sizes in MapStatus are " +
    +          "inaccurate (average is {}, {} blocks underestimated, size of underestimated is {})," +
    +          " distribution at the given probabilities(0, 0.25, 0.5, 0.75, 1.0) is {}.",
    +          taskContext.partitionId(), taskContext.attemptNumber(), taskContext.stageId(),
    +          taskContext.taskAttemptId(), hc.getAvgSize(),
    +          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
    +      }
    +    }
    --- End diff --
    
    We need to handle case of mapStatus not being HighlyCompressedMapStatus also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74938/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108052747
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java ---
    @@ -169,6 +173,36 @@ public void write(Iterator<Product2<K, V>> records) throws IOException {
           }
         }
         mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    +    if (mapStatus instanceof HighlyCompressedMapStatus) {
    +      HighlyCompressedMapStatus hc = (HighlyCompressedMapStatus) mapStatus;
    +      long underestimatedBlocksSize = 0L;
    +      for (int i = 0; i < partitionLengths.length; i++) {
    +        if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +          underestimatedBlocksSize += partitionLengths[i];
    +        }
    +      }
    +      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
    --- End diff --
    
    This will essentially be sum of every block above average size - how is this supposed to be leveraged ?
    For example:
    1, 2, 3, 4, 5, 6 => 15
    1, 2, 3, 4, 5, 10 => 15
    (This ended up being a degenerate example - but in general, I am curious what the value is for this metric).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74962/testReport)** for PR 17276 at commit [`f7ff868`](https://github.com/apache/spark/commit/f7ff868ccba47e3ef15051eb6da2dad43ec16c44).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74449/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75146/testReport)** for PR 17276 at commit [`c58cb7e`](https://github.com/apache/spark/commit/c58cb7eefe8915c10090da0a43c74c2582e7d773).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @squito 
    Would you mind help comment on this when have time ? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @jinxing64 do you mind closing this pr for now (or marking as [WIP] at least)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107161810
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
    @@ -72,6 +72,18 @@ private[spark] class SortShuffleWriter[K, V, C](
           val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
           shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
           mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    +      partitionLengths.foreach(writeMetrics.incBlockSizeDistribution(_))
    +      if (mapStatus.isInstanceOf[HighlyCompressedMapStatus]) {
    +        writeMetrics.setAverageBlockSize(
    +          mapStatus.asInstanceOf[HighlyCompressedMapStatus].getAvgSize());
    +        (0 until partitionLengths.length).foreach {
    +          case i =>
    +            if (partitionLengths(i) < mapStatus.getSizeForBlock(i)) {
    +              writeMetrics.incUnderestimatedBlocksNum()
    +              writeMetrics.incUnderestimatedBlocksSize(partitionLengths(i))
    --- End diff --
    
    another metric that may be nice to capture here is *maximum* underestimate -- `(0 until partitionLengths).map { i => partitionLengths(i) - mapStatus.getSizeForBlock(i) }.max`.  In fact, that alone might be enough to discover cases where the reduce side will OOM because of this underestimate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75229/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75163/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74981/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75098/testReport)** for PR 17276 at commit [`6a96c3b`](https://github.com/apache/spark/commit/6a96c3b2858ee53bad073040217182c19d5e7db0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @squito 
    Thanks a lot for your comments and I will think and do the test carefully :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74515/testReport)** for PR 17276 at commit [`e2e56d3`](https://github.com/apache/spark/commit/e2e56d30319634c0307e4b916f79aa29ed0239a0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #76454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76454/testReport)** for PR 17276 at commit [`873129f`](https://github.com/apache/spark/commit/873129f783d154c96803e13b94f8f16c2922cb68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74665/testReport)** for PR 17276 at commit [`7cd290d`](https://github.com/apache/spark/commit/7cd290da23ce85440fb9bdef00ed87c317295a74).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75032/testReport)** for PR 17276 at commit [`b2ba192`](https://github.com/apache/spark/commit/b2ba1923f59f31117597e60be6c02c6477aca5de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74605/testReport)** for PR 17276 at commit [`2ccde1f`](https://github.com/apache/spark/commit/2ccde1f45ffd20cab4760afe5ab43815ce2813f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75239/testReport)** for PR 17276 at commit [`873129f`](https://github.com/apache/spark/commit/873129f783d154c96803e13b94f8f16c2922cb68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74939/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74665/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74607/testReport)** for PR 17276 at commit [`91e338b`](https://github.com/apache/spark/commit/91e338b52c911366b9e6cd209c8f09ad04a2acca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76454/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75238/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74665/testReport)** for PR 17276 at commit [`7cd290d`](https://github.com/apache/spark/commit/7cd290da23ce85440fb9bdef00ed87c317295a74).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108052804
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
    @@ -72,6 +72,27 @@ private[spark] class SortShuffleWriter[K, V, C](
           val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
           shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
           mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    +
    +      mapStatus match {
    +        case hc: HighlyCompressedMapStatus =>
    +          val underestimatedLengths = partitionLengths.filter(_ > hc.getAvgSize)
    +          writeMetrics.incUnderestimatedBlocksSize(underestimatedLengths.sum)
    +          if (log.isDebugEnabled() && partitionLengths.length > 0) {
    +            // Distribution of sizes in MapStatus.
    +            Distribution(partitionLengths.map(_.toDouble)) match {
    +              case Some(distribution) =>
    +                val distributionStr = distribution.getQuantiles().mkString(", ")
    +                logDebug(s"For task ${context.partitionId()}.${context.attemptNumber()} in stage" +
    +                  s" ${context.stageId()} (TID ${context.taskAttemptId()}), the block sizes in" +
    +                  s" MapStatus are inaccurate (average is ${hc.getAvgSize}," +
    +                  s" ${underestimatedLengths.length} blocks underestimated, sum of sizes is" +
    +                  s" ${underestimatedLengths.sum}), distribution at the given probabilities" +
    +                  s" (0, 0.25, 0.5, 0.75, 1.0) is $distributionStr.")
    +              case None => // no-op
    +            }
    +          }
    --- End diff --
    
    Isn;t this not similar to what is in core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java, etc above ? Or is it different ?
    The code looked same, but written differently (and more expensive here).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107158475
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/ShuffleReadMetrics.scala ---
    @@ -80,13 +92,17 @@ class ShuffleReadMetrics private[spark] () extends Serializable {
       private[spark] def incRemoteBlocksFetched(v: Long): Unit = _remoteBlocksFetched.add(v)
       private[spark] def incLocalBlocksFetched(v: Long): Unit = _localBlocksFetched.add(v)
       private[spark] def incRemoteBytesRead(v: Long): Unit = _remoteBytesRead.add(v)
    +  private[spark] def incRemoteBytesReadToMem(v: Long): Unit = _remoteBytesReadToMem.add(v)
    +  private[spark] def incRemoteBytesReadToDisk(v: Long): Unit = _remoteBytesReadToDisk.add(v)
    --- End diff --
    
    since you are not using this yet, it shouldn't be part of this pr (I realize you may be using this as part of other changes in your own internal testing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75034/testReport)** for PR 17276 at commit [`1720b5e`](https://github.com/apache/spark/commit/1720b5e8e1d1c142cc414ae39cb620727bb92d5f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108052849
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/ShuffleReadMetrics.scala ---
    @@ -80,13 +86,15 @@ class ShuffleReadMetrics private[spark] () extends Serializable {
       private[spark] def incRemoteBlocksFetched(v: Long): Unit = _remoteBlocksFetched.add(v)
       private[spark] def incLocalBlocksFetched(v: Long): Unit = _localBlocksFetched.add(v)
       private[spark] def incRemoteBytesRead(v: Long): Unit = _remoteBytesRead.add(v)
    +  private[spark] def incRemoteBytesReadToMem(v: Long): Unit = _remoteBytesReadToMem.add(v)
    --- End diff --
    
    The way it seems to be coded up, this will end up being everything fetched from shuffle - and we can already infer it : remote bytes read + local bytes read.
    Or did I miss something here ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75146/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75238/testReport)** for PR 17276 at commit [`cf5de4a`](https://github.com/apache/spark/commit/cf5de4a19c28d50c828ac8648c250c8eaf717949).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    no worries, I'm just not sure when to look again, with all the notifications from your commits. Committers tend to think that something is ready to review if its passing tests, so its helpful to add those labels if its not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75238/testReport)** for PR 17276 at commit [`cf5de4a`](https://github.com/apache/spark/commit/cf5de4a19c28d50c828ac8648c250c8eaf717949).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74931/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75039/testReport)** for PR 17276 at commit [`c9276c2`](https://github.com/apache/spark/commit/c9276c2051436a1b27c844c4aae87110bc39c002).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @squito
    Thanks a lot for taking time looking into this pr.
    I updated the pr. Currently just add two metrics: a) the total size of underestimated blocks size, b) the size of blocks shuffled to memory.
    For a), executor use `maxBytesInFlight` to control the speed of shuffle-read. I agree with your comment `another metric that may be nice to capture here is maximum underestimate`. But think about this scenario: the maximum is small, but thousands of blocks are underestimated, thus `maxBytesInFlight` cannot help avoid the OOM during shuffle-read. That's why I proposed to track the metrics of total size of underestimated blocks size;
    For b), currently all data are shuffled-read to memory. If we add the feature of shuffling to disk when memory shortage, we need to evaluate the performance. I think another two metrics need to be taken into account: the size of blocks shuffled to disk(to be added in another pr) and task's running time(already exist). The more data shuffled to memory, the better performance; The shorter time cost, the better performance.
    
    I also added some log for debug in `ShuffleWriter`, including the num of underestimated blocks and the size distribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75145/testReport)** for PR 17276 at commit [`4f992fc`](https://github.com/apache/spark/commit/4f992fcd363bc7632da71093dfc8f61477d0cd1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75239/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74515/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75220/testReport)** for PR 17276 at commit [`c26ea56`](https://github.com/apache/spark/commit/c26ea561579a76f3370fc2ebc07314e788dc32a1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75163/testReport)** for PR 17276 at commit [`0efa348`](https://github.com/apache/spark/commit/0efa348d1beccf76da702d06ee77c8aac2aebb12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r108061417
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java ---
    @@ -169,6 +173,36 @@ public void write(Iterator<Product2<K, V>> records) throws IOException {
           }
         }
         mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
    +    if (mapStatus instanceof HighlyCompressedMapStatus) {
    +      HighlyCompressedMapStatus hc = (HighlyCompressedMapStatus) mapStatus;
    +      long underestimatedBlocksSize = 0L;
    +      for (int i = 0; i < partitionLengths.length; i++) {
    +        if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +          underestimatedBlocksSize += partitionLengths[i];
    +        }
    +      }
    +      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
    +      if (logger.isDebugEnabled() && partitionLengths.length > 0) {
    +        int underestimatedBlocksNum = 0;
    +        // Distribution of sizes in MapStatus.
    +        double[] cp = new double[partitionLengths.length];
    +        for (int i = 0; i < partitionLengths.length; i++) {
    +          cp[i] = partitionLengths[i];
    +          if (partitionLengths[i] > mapStatus.getSizeForBlock(i)) {
    +            underestimatedBlocksNum++;
    +          }
    +        }
    +        Distribution distribution = new Distribution(cp, 0, cp.length);
    +        double[] probabilities = {0.0, 0.25, 0.5, 0.75, 1.0};
    +        String distributionStr = distribution.getQuantiles(probabilities).mkString(", ");
    +        logger.debug("For task {}.{} in stage {} (TID {}), the block sizes in MapStatus are " +
    +          "inaccurate (average is {}, {} blocks underestimated, size of underestimated is {})," +
    +          " distribution at the given probabilities(0, 0.25, 0.5, 0.75, 1.0) is {}.",
    +          taskContext.partitionId(), taskContext.attemptNumber(), taskContext.stageId(),
    +          taskContext.taskAttemptId(), hc.getAvgSize(),
    +          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
    +      }
    +    }
    --- End diff --
    
    In `CompressedMapStatus`, the blocks sizes are accurate, so I might hesitate to add that log.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74931 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74931/testReport)** for PR 17276 at commit [`7ac639f`](https://github.com/apache/spark/commit/7ac639f9239ef943e55a1cc76dc7346bbe617bfd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107157471
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
    @@ -72,6 +72,18 @@ private[spark] class SortShuffleWriter[K, V, C](
           val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
           shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
           mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    +      partitionLengths.foreach(writeMetrics.incBlockSizeDistribution(_))
    +      if (mapStatus.isInstanceOf[HighlyCompressedMapStatus]) {
    +        writeMetrics.setAverageBlockSize(
    +          mapStatus.asInstanceOf[HighlyCompressedMapStatus].getAvgSize());
    --- End diff --
    
    in scala, `if( isInstanceOf) { asInstanceOf}` can be replaced by pattern matching:
    
    ```scala
    mapStatus match {
      case hc: HighlyCompressedMapStatus => writeMetrics.setAverageBlockSize(hc.getAvgSize());
      case _ =>  //no-op
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75032/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #76454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76454/testReport)** for PR 17276 at commit [`873129f`](https://github.com/apache/spark/commit/873129f783d154c96803e13b94f8f16c2922cb68).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @squito oh, I feel sorry if this is disturbing. I will mark it as wip.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74448/testReport)** for PR 17276 at commit [`430ec95`](https://github.com/apache/spark/commit/430ec95291393f49096fc07df98c15e700f23e8c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75145/testReport)** for PR 17276 at commit [`4f992fc`](https://github.com/apache/spark/commit/4f992fcd363bc7632da71093dfc8f61477d0cd1e).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74509/testReport)** for PR 17276 at commit [`d0932ed`](https://github.com/apache/spark/commit/d0932ed6ba0dfe463d1e77e9cae79351d233effb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74605/testReport)** for PR 17276 at commit [`2ccde1f`](https://github.com/apache/spark/commit/2ccde1f45ffd20cab4760afe5ab43815ce2813f1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74917/testReport)** for PR 17276 at commit [`0e85332`](https://github.com/apache/spark/commit/0e853322a903018bd971f1983865fc2ab09246d7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74962/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74962/testReport)** for PR 17276 at commit [`f7ff868`](https://github.com/apache/spark/commit/f7ff868ccba47e3ef15051eb6da2dad43ec16c44).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74939/testReport)** for PR 17276 at commit [`a88e12e`](https://github.com/apache/spark/commit/a88e12e1c47a756336017e80295c13c525287981).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    You are so kind person. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75239/testReport)** for PR 17276 at commit [`873129f`](https://github.com/apache/spark/commit/873129f783d154c96803e13b94f8f16c2922cb68).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75146/testReport)** for PR 17276 at commit [`c58cb7e`](https://github.com/apache/spark/commit/c58cb7eefe8915c10090da0a43c74c2582e7d773).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74449/testReport)** for PR 17276 at commit [`648ceaa`](https://github.com/apache/spark/commit/648ceaafc977b442916834960122069237002e89).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17276: [SPARK-19937] Collect metrics of block sizes when...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17276#discussion_r107160710
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
    @@ -72,6 +72,18 @@ private[spark] class SortShuffleWriter[K, V, C](
           val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
           shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
           mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    +      partitionLengths.foreach(writeMetrics.incBlockSizeDistribution(_))
    +      if (mapStatus.isInstanceOf[HighlyCompressedMapStatus]) {
    +        writeMetrics.setAverageBlockSize(
    +          mapStatus.asInstanceOf[HighlyCompressedMapStatus].getAvgSize());
    +        (0 until partitionLengths.length).foreach {
    +          case i =>
    +            if (partitionLengths(i) < mapStatus.getSizeForBlock(i)) {
    --- End diff --
    
    don't you want the condition reversed? `partitionLengths(i)` is the true size, so an underestimate is if the mapStatus says the size is smaller.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75229/testReport)** for PR 17276 at commit [`8801fc6`](https://github.com/apache/spark/commit/8801fc6454fc09d417dea563cd07cc255ae009b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #75229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75229/testReport)** for PR 17276 at commit [`8801fc6`](https://github.com/apache/spark/commit/8801fc6454fc09d417dea563cd07cc255ae009b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74981/testReport)** for PR 17276 at commit [`d86f985`](https://github.com/apache/spark/commit/d86f985fb63204715189c1e1a964f127c396df59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    **[Test build #74917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74917/testReport)** for PR 17276 at commit [`0e85332`](https://github.com/apache/spark/commit/0e853322a903018bd971f1983865fc2ab09246d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74605/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75039/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [SPARK-19937] Collect metrics of block sizes when shuffl...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    @mridulm 
    Thanks a lot for taking time looking into this and thanks for comments :)
    1) I changed the size of underestimated blocks to be `partitionLengths.filter(_ > hc.getAvgSize).map(_ - hc.getAvgSize).sum`
    2) I added a method `genBlocksDistributionStr` and call it from `ShuffleWriters`, thus avoid duplicate codes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17276: [WIP][SPARK-19937] Collect metrics of block sizes when s...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17276
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org