You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/05/26 06:36:39 UTC

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/13318

    [SPARK-15391] [SQL] manage the temporary memory of timsort

    ## What changes were proposed in this pull request?
    
    Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
    
    This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
    
    This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.75), it enables use to radix sort also makes sure that we have enough memory for timsort. 
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_timsort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13318
    
----
commit 6d074f6e3ad41f427e6dcb9f5a72674798a40b5e
Author: Davies Liu <da...@databricks.com>
Date:   2016-05-26T06:29:09Z

    manage the temporary memory of timsort

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by sitalkedia <gi...@git.apache.org>.

Github user sitalkedia commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221986302
  
    Thank you for making this fix 👍 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221993459
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59405/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13318


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222026275
  
    **[Test build #3022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3022/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65629505
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -70,9 +70,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
         assert (initialSize > 0);
         this.initialSize = initialSize;
         this.useRadixSort = useRadixSort;
    -    this.memoryAllocationFactor = useRadixSort ? 2 : 1;
         this.array = consumer.allocateArray(initialSize);
    -    this.sorter = new Sorter<>(ShuffleSortDataFormat.INSTANCE);
    +    this.usableCapacity = calcCapacity();
    +  }
    +
    +  private int calcCapacity() {
    +    // Radix sort requires same amount of used memory as buffer, Tim sort requires
    +    // half of the used memory as buffer.
    --- End diff --
    
    by the way, out of curiosity, what's the worst case scenario in TimSort that requires 0.5x more buffer space?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64701761
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    I think we should keep the memoryAllocationFactor variable, to make it more clear we are overallocating on purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221998576
  
    **[Test build #59418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59418/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222067144
  
    **[Test build #59465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59465/consoleFull)** for PR 13318 at commit [`8b4e033`](https://github.com/apache/spark/commit/8b4e0337da43d65bc5b8b80862bc257a001fd16f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59670/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64842068
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -54,14 +54,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
       private final boolean useRadixSort;
     
       /**
    -   * Set to 2x for radix sort to reserve extra memory for sorting, otherwise 1x.
    +   * The position in the pointer array where new records can be inserted.
        */
    -  private final int memoryAllocationFactor;
    +  private int pos = 0;
     
       /**
    -   * The position in the pointer array where new records can be inserted.
    +   * How many records could be inserted, because part of the array should be left for sorting.
        */
    -  private int pos = 0;
    +  private int capacity = 0;
    --- End diff --
    
    nit: usableCapacity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by sitalkedia <gi...@git.apache.org>.

Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64782530
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    IMO, it will be better to use the memoryAllocationFactor. Because otherwise, we might see unnecessary spilling due to memory under utilization. If we change the memory allocation factor from 2 to 1.5 for TimSort, that is a significant amount of memory considering the size of the pointer array can be in the order of GBs for large workload. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65628414
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -70,9 +70,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
         assert (initialSize > 0);
         this.initialSize = initialSize;
         this.useRadixSort = useRadixSort;
    -    this.memoryAllocationFactor = useRadixSort ? 2 : 1;
         this.array = consumer.allocateArray(initialSize);
    -    this.sorter = new Sorter<>(ShuffleSortDataFormat.INSTANCE);
    +    this.usableCapacity = calcCapacity();
    +  }
    +
    +  private int calcCapacity() {
    --- End diff --
    
    nit: `calculateUsableCapacity` or `getUsableCapacity`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64701926
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSortDataFormat.java ---
    @@ -83,9 +84,8 @@ public void copyRange(LongArray src, int srcPos, LongArray dst, int dstPos, int
     
       @Override
       public LongArray allocate(int length) {
    -    assert (length < Integer.MAX_VALUE / 2) : "Length " + length + " is too large";
    -    // This is used as temporary buffer, it's fine to allocate from JVM heap.
    -    return new LongArray(MemoryBlock.fromLongArray(new long[length * 2]));
    +    assert (length * 2 <= buffer.size());
    --- End diff --
    
    Better error message for the assert?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    **[Test build #59949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59949/consoleFull)** for PR 13318 at commit [`a929a06`](https://github.com/apache/spark/commit/a929a0600b5cd78917989a5559275a4a4e9f1812).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318
  
    **[Test build #59670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59670/consoleFull)** for PR 13318 at commit [`2ad8eb8`](https://github.com/apache/spark/commit/2ad8eb8f9c21c1309ab18cc4e81897bdf54afa86).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221993227
  
    **[Test build #59405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59405/consoleFull)** for PR 13318 at commit [`f651956`](https://github.com/apache/spark/commit/f651956723cc36465458ee3a802e331dfc108968).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    Looks great overall, just few questions and a couple of minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221809547
  
    **[Test build #59350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59350/consoleFull)** for PR 13318 at commit [`6d074f6`](https://github.com/apache/spark/commit/6d074f6e3ad41f427e6dcb9f5a72674798a40b5e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by sitalkedia <gi...@git.apache.org>.

Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64771087
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    +1 on that. 
    Is there any particular reason for removing the memoryAllocationFactor? Without this we might highly underutilize memory for TimSort. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222029566
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221789635
  
    **[Test build #59350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59350/consoleFull)** for PR 13318 at commit [`6d074f6`](https://github.com/apache/spark/commit/6d074f6e3ad41f427e6dcb9f5a72674798a40b5e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65753944
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -221,7 +221,8 @@ public BytesToBytesMap(
           SparkEnv.get() != null ? SparkEnv.get().blockManager() :  null,
           SparkEnv.get() != null ? SparkEnv.get().serializerManager() :  null,
           initialCapacity,
    -      0.70,
    +      // In order to re-use the longArray for sorting, the load factor cannot be larger than 0.5.
    +      0.5,
    --- End diff --
    
    As analized in the comment, this only have about 10% (or less) difference in practice, I'd like not to have this complexity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65629261
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -221,7 +221,8 @@ public BytesToBytesMap(
           SparkEnv.get() != null ? SparkEnv.get().blockManager() :  null,
           SparkEnv.get() != null ? SparkEnv.get().serializerManager() :  null,
           initialCapacity,
    -      0.70,
    +      // In order to re-use the longArray for sorting, the load factor cannot be larger than 0.5.
    +      0.5,
    --- End diff --
    
    given that this may cause `BytesToBytesMap` to spill to disk more, do we have a sense of the performance implications of this change? Also, can we make this configurable (so that if `spark.sql.sort.enableRadixSort` = false, users can explicitly set the loadFactor to 0.7)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64782151
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    Should we also use 1/1.5 in BytesToBytesMap as load factor when radix sort could not be used? I feel that's too complicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222022579
  
    **[Test build #3021 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3021/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221998609
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59418/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221791386
  
    Can we add a unit test for this behavior?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64701881
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -86,14 +88,16 @@ public UnsafeKVExternalSorter(
             prefixComparator,
             /* initialSize */ 4096,
             pageSizeBytes,
    -        keySchema.length() == 1 && SortPrefixUtils.canSortFullyWithPrefix(keySchema.apply(0)));
    +        canUseRadixSort);
         } else {
    +      // The array will be used to do in-place sort, which require half of the space to be empty.
    +      assert(map.numKeys() <= map.getArray().size() / 2);
           // During spilling, the array in map will not be used, so we can borrow that and use it
           // as the underline array for in-memory sorter (it's always large enough).
           // Since we will not grow the array, it's fine to pass `null` as consumer.
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
             null, taskMemoryManager, recordComparator, prefixComparator, map.getArray(),
    -        false /* TODO(ekl) we can only radix sort if the BytesToBytes load factor is <= 0.5 */);
    +        canUseRadixSort);
    --- End diff --
    
    Can we assert the load factor is under 0.5 here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13318
  
    @ericl Comments addressed, could you take another look? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222068319
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59465/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221789570
  
    cc @ericl 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221809818
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64805207
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318
  
    **[Test build #59670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59670/consoleFull)** for PR 13318 at commit [`2ad8eb8`](https://github.com/apache/spark/commit/2ad8eb8f9c21c1309ab18cc4e81897bdf54afa86).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221993456
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by ericl <gi...@git.apache.org>.

Github user ericl commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64701888
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -221,7 +221,7 @@ public BytesToBytesMap(
           SparkEnv.get() != null ? SparkEnv.get().blockManager() :  null,
           SparkEnv.get() != null ? SparkEnv.get().serializerManager() :  null,
           initialCapacity,
    -      0.70,
    +      0.5,
    --- End diff --
    
    Should we make this a constant and document why it was chosen (to enable radix sort)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221998607
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    **[Test build #59949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59949/consoleFull)** for PR 13318 at commit [`a929a06`](https://github.com/apache/spark/commit/a929a0600b5cd78917989a5559275a4a4e9f1812).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64780973
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -101,14 +94,17 @@ public void expandPointerArray(LongArray newArray) {
           array.getBaseOffset(),
           newArray.getBaseObject(),
           newArray.getBaseOffset(),
    -      array.size() * (8 / memoryAllocationFactor)
    +      pos * 8
    --- End diff --
    
    In theory, memoryAllocationFactor should be 2 for radix sort and 1.5 for tim sort. For simplicity, I'd like to trade a little bit memory efficiency for simplicity. memoryAllocationFactor is only used in hasSpaceForAnotherRecord(), if you feel strong on this tradeoff, I could update hasSpaceForAnotherRecord() to use 1.5 for timsort.
    
    For here, we don't need memoryAllocationFactor actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221793398
  
    It was 0.70 (corrected), it's 30% lower after this patch. For the simplest aggregate (one integer key and one integer value), the key-value pair need 40 bytes, the pointer array need 22+ bytes, becomes 32 bytes after the patch, it will need 15% more memory. Before this patch, TimSort still allocate some memory from heap (could OOM), so the difference will be even smaller.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222027875
  
    **[Test build #59432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59432/consoleFull)** for PR 13318 at commit [`2ad469d`](https://github.com/apache/spark/commit/2ad469d20d3eda509fdf8e4245e09f5b5bf4ca46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222029556
  
    **[Test build #59432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59432/consoleFull)** for PR 13318 at commit [`2ad469d`](https://github.com/apache/spark/commit/2ad469d20d3eda509fdf8e4245e09f5b5bf4ca46).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221996288
  
    **[Test build #59418 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59418/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65753668
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -70,9 +70,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
         assert (initialSize > 0);
         this.initialSize = initialSize;
         this.useRadixSort = useRadixSort;
    -    this.memoryAllocationFactor = useRadixSort ? 2 : 1;
         this.array = consumer.allocateArray(initialSize);
    -    this.sorter = new Sorter<>(ShuffleSortDataFormat.INSTANCE);
    +    this.usableCapacity = calcCapacity();
    +  }
    +
    +  private int calcCapacity() {
    +    // Radix sort requires same amount of used memory as buffer, Tim sort requires
    +    // half of the used memory as buffer.
    --- End diff --
    
    @sameeragarwal I think you already had a test case for the worst case (there are two ordered part in the array, the shortest part will be copied into a temporary buffer, os it need 0.5 buffer space), it's doced in TimSort.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221791831
  
    FYI "This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.75)" this is a pretty low load factor.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13318: [SPARK-15391] [SQL] manage the temporary memory of timso...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13318
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59949/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65628155
  
    --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java ---
    @@ -70,9 +70,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
         assert (initialSize > 0);
         this.initialSize = initialSize;
         this.useRadixSort = useRadixSort;
    -    this.memoryAllocationFactor = useRadixSort ? 2 : 1;
         this.array = consumer.allocateArray(initialSize);
    -    this.sorter = new Sorter<>(ShuffleSortDataFormat.INSTANCE);
    +    this.usableCapacity = calcCapacity();
    +  }
    +
    +  private int calcCapacity() {
    +    // Radix sort requires same amount of used memory as buffer, Tim sort requires
    +    // half of the used memory as buffer.
    --- End diff --
    
    can we also update the comments above while instantiating `array` to better indicate the contents of the array (and talk about the additional buffer that's needed for sorting)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222068302
  
    **[Test build #59465 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59465/consoleFull)** for PR 13318 at commit [`8b4e033`](https://github.com/apache/spark/commit/8b4e0337da43d65bc5b8b80862bc257a001fd16f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222024730
  
    **[Test build #3022 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3022/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r64781946
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -86,14 +88,16 @@ public UnsafeKVExternalSorter(
             prefixComparator,
             /* initialSize */ 4096,
             pageSizeBytes,
    -        keySchema.length() == 1 && SortPrefixUtils.canSortFullyWithPrefix(keySchema.apply(0)));
    +        canUseRadixSort);
         } else {
    +      // The array will be used to do in-place sort, which require half of the space to be empty.
    +      assert(map.numKeys() <= map.getArray().size() / 2);
           // During spilling, the array in map will not be used, so we can borrow that and use it
           // as the underline array for in-memory sorter (it's always large enough).
           // Since we will not grow the array, it's fine to pass `null` as consumer.
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
             null, taskMemoryManager, recordComparator, prefixComparator, map.getArray(),
    -        false /* TODO(ekl) we can only radix sort if the BytesToBytes load factor is <= 0.5 */);
    +        canUseRadixSort);
    --- End diff --
    
    We only need to make sure that only half of array could be used. The load factor could be different. For example, we could move the check into TungstenAggregate without changing the load factor of BytesToBytesMap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222024003
  
    **[Test build #3021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3021/consoleFull)** for PR 13318 at commit [`c72cdea`](https://github.com/apache/spark/commit/c72cdea99972204899de52ed3ed4cc206a128081).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222029567
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59432/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65754375
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -221,7 +221,8 @@ public BytesToBytesMap(
           SparkEnv.get() != null ? SparkEnv.get().blockManager() :  null,
           SparkEnv.get() != null ? SparkEnv.get().serializerManager() :  null,
           initialCapacity,
    -      0.70,
    +      // In order to re-use the longArray for sorting, the load factor cannot be larger than 0.5.
    +      0.5,
    --- End diff --
    
    This change also us to use radix sort when switching happens, so it's not always regression of performance. It's hard to measure the impact for different workloads, I'd like to prefer for robustness instead of small performance improvements for some workload.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-222068317
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13318: [SPARK-15391] [SQL] manage the temporary memory o...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13318#discussion_r65757237
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -221,7 +221,8 @@ public BytesToBytesMap(
           SparkEnv.get() != null ? SparkEnv.get().blockManager() :  null,
           SparkEnv.get() != null ? SparkEnv.get().serializerManager() :  null,
           initialCapacity,
    -      0.70,
    +      // In order to re-use the longArray for sorting, the load factor cannot be larger than 0.5.
    +      0.5,
    --- End diff --
    
    sure, that sounds fair


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221969935
  
    **[Test build #59405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59405/consoleFull)** for PR 13318 at commit [`f651956`](https://github.com/apache/spark/commit/f651956723cc36465458ee3a802e331dfc108968).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15391] [SQL] manage the temporary memor...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13318#issuecomment-221809821
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59350/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org