You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by aarondav <gi...@git.apache.org> on 2014/07/20 22:39:24 UTC

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/1502

    SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage

    ### Why and what?
    Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.
    
    This patch adds a Sorter API, intended for in memory sorts, which simply ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses a merge sort) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.
    
    Please compare our port of the Java 6 sort with the original implementation: http://www.diffchecker.com/kh9ufcqo
    
    ### Memory implications
    An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.
    
    Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
    This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6).
    
    The Sorter does not require allocating any tuples, but since it uses the Java 6 merge sort algorithm, it does copy the entire array (and that is the entire array, not just the half needed for Tuples).
    This results in a sorting overhead of 8N bytes.
    
    Thus, we have reduced the overhead of the sort by roughly 20 bytes times the number of elements.
    
    ### Performance implications
    As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance.
    
    Indeed, this PR implements Java 6's merge sort rather than the Java 7 Timsort, which is much more performant. A future optimization is to port the Timsort over, which the SortDataFormat API should support with minimal changes.
    
    Nevertheless, here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()". 
    
    <table>
    <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
    <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
    <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
    <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
    <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>18232 ms</b></td><td><b>2030 ms</b></td></tr>
    <tr><td>Microbenchmarks are stupid</td><td></td><td></td><td>25708 ms</td><td>2400 ms</td></tr>
    </table>
    
    Note that the final test was the same as "KV-array using Sorter", but with a second implementation of SortDataFormat loaded in the JVM (presumably causing a de-opt for the virtual function call shortcircuit).
    
    The results show that this Sorter performs exactly as expected -- it is about as fast as the Java 6 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6.
    
    The Java 7 Timsort provided a huge speedup in this benchmark, suggesting that using it instead would result in roughly a 2x speedup. However, the Tuple-based approach is still not significantly faster despite the much better algorithm.
    
    In short, this patch should significantly improve performance for users running Java 6, and provide a minor performance degradation for users running Java 7.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark sort

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1502
    
----
commit 6307338402e7872368bf61993b1be4ed7302cb6e
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-07-19T19:21:29Z

    SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
    
    Currently, the AppendOnlyMap performs an "in-place" sort by converting
    its array of [key, value, key, value] pairs into a an array of
    [(key, value), (key, value)] pairs. However, this causes us to allocate
    many Tuple2 objects, which come at a nontrivial overhead.
    
    This patch adds a Sorter API, intended for in memory sorts, which simply
    ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses
    a merge sort) and abstracts the interface in a way which introduces no more
    than 1 virtual function invocation of overhead at each abstraction point.
    
    Please compare our port of the Java 6 sort with the original implementation:
        http://www.diffchecker.com/kh9ufcqo
    
    === Memory implications ===
    An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements
    within its underlying array. Each of these elements is 4 bytes wide in a
    [compressed OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.
    
    Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes
    in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version
    immediately copies the entire array (4N bytes here), while the Java 7
    version has a worst-case allocation of half the array (2N bytes).
    This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6).
    
    The Sorter does not require allocating any tuples, but since it uses the Java 6
    merge sort algorithm, it does copy the entire array (and that is the entire array,
    not just the half needed for Tuples).
    This results in a sorting overhead of 8N bytes.
    
    Thus, we have reduced the overhead of the sort by roughly 20 bytes times the
    number of elements.
    
    === Performance implications ===
    As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap,
    the purpose of this patch is to provide stability by reducing memory usage rather
    than improve performance.
    
    Indeed, this PR implements Java 6's merge sort rather than the Java 7 Timsort, which
    is much more performant. A future optimization is to port the Timsort over, which the
    SortDataFormat API should support with minimal changes.
    
    Nevertheless, here are the results of a microbenchmark that sorted 25 million, randomly
    distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**,
    and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()".
    
    <table>
    <tr><th>Java version</th><th>Test</th><th>First run</th><th>Average of 10</th></tr>
    <tr><td>6</td><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td></tr>
    <tr><td>6</td><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td></tr>
    <tr><td>6</td><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td></tr>
    <tr><td>7</td><td>primitive Arrays.sort()</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
    <tr><td>7</td><td>Arrays.sort()</td><td>13201 ms</td><td>878 ms</td></tr>
    <tr><td>7</td><td>Tuple-sort using Arrays.sort()</td><td>20990 ms</td><td>1919 ms</td></tr>
    <tr><td>7</td><td>**KV-sort using Sorter**</td><td>**18232 ms**</td><td>**2030 ms**</td></tr>
    <tr><td>7</td><td>Microbenchmarks are stupid</td><td>25708 ms</td><td>2400 ms</td></tr>
    </table>
    
    Note that the final test was the same as KV-sort using Sorter, but with a second
    impelementation of SortDataFormat loaded in the JVM (presumably causing a de-opt
    for the virtual function call shortcircuit).
    
    The results show that this Sorter performs exactly as expected -- it is about as fast as the
    Java 6 Arrays.sort() (which shares the same algorithm), but is significantly faster than
    the Tuple-sort on Java 6.
    
    The Java 7 Timsort provided a huge speedup in this benchmark, suggesting that using it instead
    would result in roughly a 2x speedup. However, the Tuple-based approach is still not
    significantly faster despite the much better algorithm.
    
    In short, this patch should significantly improve performance on users running Java 6, and
    provide a minor performance degradation for users running Java 7.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49769068
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49564530
  
    QA results for PR 1502:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16883/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49579153
  
    License updated, as well as performance numbers. Our memory overhead is now very low on certain workloads (4*N bytes worst case scenario, but experimental results have shown remarkably little scratch space allocated), and our performance is now 2-4 times better than the previous implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49575204
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16898/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49576105
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1502#discussion_r15214488
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala ---
    @@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C](
           if (it.hasNext) {
             var kc = it.next()
             kcPairs += kc
    -        val minHash = getKeyHashCode(kc)
    +        val minHash = hashKey(kc)
    --- End diff --
    
    Just curious, is this more efficient than calling ExternalAppendOnlyMap.hash directly as we did before? It was kind of weird that we were doing it in an inner class, maybe it created another pointer derefernece.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1502#discussion_r15241580
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala ---
    @@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C](
           if (it.hasNext) {
             var kc = it.next()
             kcPairs += kc
    -        val minHash = getKeyHashCode(kc)
    +        val minHash = hashKey(kc)
    --- End diff --
    
    Ah, got it, that makes a lot of sense. I actually ran into this problem before (calling == or hashCode on the (K, C) pair instead of the K).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1502#discussion_r15239395
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala ---
    @@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C](
           if (it.hasNext) {
             var kc = it.next()
             kcPairs += kc
    -        val minHash = getKeyHashCode(kc)
    +        val minHash = hashKey(kc)
    --- End diff --
    
    This isn't for efficiency, it's actually for type safety. I changed this because when I changed the meaning of getKeyHashCode to just hash(whatever), these calls were not compile errors, though they no longer did the right thing. Now we actually ensure you pass in a (K, C) pair rather than just a Tuple2 (or, now, any object).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49784327
  
    Looks like this actually passed unit tests but there's a binary compatibility thing introduced by a previous commit to MLlib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49717790
  
    QA results for PR 1502:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1502#discussion_r15215866
  
    --- Diff: LICENSE ---
    @@ -483,6 +483,24 @@ SUCH DAMAGE.
     
     
     ========================================================================
    +For Timsort:
    --- End diff --
    
    Say which source file this is in (core/src/main/java/...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49585784
  
    QA results for PR 1502:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49576500
  
    Hey Aaron, make sure you add a note in the LICENSE file saying this part of the code is from Android (similar to the other notes there).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49784482
  
    Actually Xiangrui said he fixed that, so I'm going to merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49561728
  
    QA results for PR 1502:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>trait SortDataFormat[K, Buffer] extends Any {<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16881/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49575156
  
    He did it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49581041
  
    QA results for PR 1502:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16898/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49562265
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16883/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1502


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49577557
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49566043
  
    The "added public classes" are private[spark] and package-private to org.apache.spark.util, repsectively.
    
    @mateiz Please take a brief look at the API to make sure it's still suitable for your SizeTrackingCollection's destructiveSortedIterator. The only major change to that API was that the Comparator now only takes the key instead of a (K, C) pair.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49574913
  
    I spoke with @aarondav, but I'm not sure we can borrow this code from Java if it is LGPL licensed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49558941
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16881/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49582954
  
    QA results for PR 1502:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16902/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49575933
  
    In light of that minor issue, I have ported an Apache v2 Timsort (from the Android repos). It's a bit longer, but far more performant (roughly twice as fast!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49584160
  
    QA results for PR 1502:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16903/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49578187
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16905/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49783108
  
    QA results for PR 1502:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class Sorter<K, Buffer> {<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49576760
  
    BTW the API looks good to me! Actually it might allow more efficient ways to keep track of the partition ID for each key in my case too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49586631
  
    QA results for PR 1502:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).<br>class KVArraySortDataFormat[K, T <: AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {<br>class Sorter<K, Buffer> {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16909/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49708347
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1502#discussion_r15215883
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/Sorter.java ---
    @@ -0,0 +1,915 @@
    +/*
    --- End diff --
    
    This is a Java file, but you put it in the src/main/scala folder! That works with SBT but it's pretty confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49709588
  
    This looks good to me other than the question above. This will be pretty cool for sorting multiple columnar data structures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49707787
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49559170
  
    Cool. What about P^3 sort? :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49634069
  
    (This PR passed Jenkins 3 times and then failed inside HiveContext -- it's probably OK. I submitted https://github.com/apache/spark/pull/1514 to fix the flakey test.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1502#issuecomment-49579335
  
    QA tests have started for PR 1502. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16909/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---