You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/10/30 20:12:47 UTC

[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/9383

    [SPARK-11425] Improve Hybrid aggregation

    After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.
    
    This PR include #9381, will rebase once that one is merged.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark fix_switch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9383
    
----
commit 374ec3b8406c4da25382d65bb0118671da54cda1
Author: Davies Liu <da...@databricks.com>
Date:   2015-10-29T20:10:35Z

    remove prepare

commit fed4ab64d0865a1b6a0d8e3cc443a45e44fd7ad4
Author: Davies Liu <da...@databricks.com>
Date:   2015-10-30T17:53:47Z

    remove dead code

commit 258d27524ddbd0b8df62c27a269f74b2a819027c
Author: Davies Liu <da...@databricks.com>
Date:   2015-10-30T19:03:41Z

    fix mima

commit 5707f5b3421fdab0b01ee2a66acf50b59752152b
Author: Davies Liu <da...@databricks.com>
Date:   2015-10-30T19:08:29Z

    Do hash-based aggregation for all records before switch to sort-based

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43567962
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -860,45 +844,48 @@ void growAndRehash() {
           resizeStartTime = System.nanoTime();
         }
         // Store references to the old data structures to be used when we re-hash
    -    final LongArray oldLongArray = longArray;
    -    final BitSet oldBitSet = bitset;
    -    final int oldCapacity = (int) oldBitSet.capacity();
    -
    -    // Allocate the new data structures
    -    try {
    -      allocate(Math.min(growthStrategy.nextCapacity(oldCapacity), MAX_CAPACITY));
    -    } catch (OutOfMemoryError oom) {
    -      longArray = oldLongArray;
    -      bitset = oldBitSet;
    -      throw oom;
    -    }
    +    final long[] oldLongArray = longArray;
    +    allocate(Math.min(growthStrategy.nextCapacity(longArray.length / 2), MAX_CAPACITY));
    --- End diff --
    
    Don't we need to catch `OutOfMemoryError` anymore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153121661
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153496072
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152670962
  
    **[Test build #44714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44714/consoleFull)** for PR 9383 at commit [`0512f1e`](https://github.com/apache/spark/commit/0512f1e5f91b81af669e7fdbee26da764e9deb02).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153500615
  
    @davies, @yhuai: I tried taking two sample regression test queries that you listed here and putting them into AggregationQuerySuite, which should automatically trigger spilling after a small number of records, but I wasn't able to reproduce the exception here. How did you get this to fail?
    
    I want to make sure that the regression test that we add is actually capable of failing prior to this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152683162
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43846549
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java ---
    @@ -386,24 +388,37 @@ public void insertKVRecord(Object keyBase, long keyOffset, int keyLen,
       }
     
       /**
    +   * Merges another UnsafeExternalSorters into this one, the other one will be emptied.
    +   *
    +   * @throws IOException
    +   */
    +  public void merge(UnsafeExternalSorter other) throws IOException {
    +    other.spill();
    +    spillWriters.addAll(other.spillWriters);
    +    // remove them from `spillWriters`, or the files will be deleted in `cleanupResources`.
    +    other.spillWriters.clear();
    +    other.cleanupResources();
    +  }
    +
    +  /**
        * Returns a sorted iterator. It is the caller's responsibility to call `cleanupResources()`
        * after consuming this iterator.
        */
       public UnsafeSorterIterator getSortedIterator() throws IOException {
    -    assert(inMemSorter != null);
    -    readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
    -    int numIteratorsToMerge = spillWriters.size() + (readingIterator.hasNext() ? 1 : 0);
         if (spillWriters.isEmpty()) {
    +      assert(inMemSorter != null);
    +      readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
           return readingIterator;
         } else {
           final UnsafeSorterSpillMerger spillMerger =
    -        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, numIteratorsToMerge);
    +        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, spillWriters.size());
           for (UnsafeSorterSpillWriter spillWriter : spillWriters) {
             spillMerger.addSpillIfNotEmpty(spillWriter.getReader(blockManager));
           }
    -      spillWriters.clear();
    --- End diff --
    
    @JoshRosen Do you remember why we need to clear this? Once cleared, how to delete the spilled files?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152622322
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153485327
  
    Maybe `sqlContext.range(100000).selectExpr("id", "repeat(id, 20) as s").groupBy("s").agg($"s")` is a better case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697137
  
    **[Test build #44728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44728/consoleFull)** for PR 9383 at commit [`28f84e1`](https://github.com/apache/spark/commit/28f84e1f2703547a5102ee5ae277e5dfbd522403).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153926678
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45077/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152670739
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153573130
  
    **[Test build #44997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44997/consoleFull)** for PR 9383 at commit [`fc5e052`](https://github.com/apache/spark/commit/fc5e052ff17560d02ef7cdeec91a4a30605c65f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902810
  
    LGTM pending jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153447782
  
    @JoshRosen When the grouping key is StringType, but no column in buffer, then got:
    ```
    java.lang.UnsupportedOperationException
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:193)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(generated.java:40)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:643)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:517)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:779)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:128)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137)
    	at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137)
    	at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    	at org.apache.spark.scheduler.Task.run(Task.scala:88)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:745)
    15/10/28 23:25:08 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3, runningTasks: 0
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153496037
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153552008
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153189866
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44830/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153590796
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153201802
  
    ping @yhuai @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153527227
  
    **[Test build #44959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153121113
  
    **[Test build #44820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)** for PR 9383 at commit [`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152622296
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153127602
  
    **[Test build #44823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44823/consoleFull)** for PR 9383 at commit [`2e341f5`](https://github.com/apache/spark/commit/2e341f50b656d0effe36004b6abc68898a119f35).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153125548
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43846718
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala ---
    @@ -589,6 +588,13 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te
         }
       }
     
    +  test("no aggregation function") {
    +    val df = sqlContext.range(20).selectExpr("id", "repeat(id, 1) as s")
    +      .groupBy("s").count()
    +      .groupBy().count()
    +    checkAnswer(df, Row(20) :: Nil)
    +  }
    --- End diff --
    
    Added JIRA number


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43784097
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java ---
    @@ -386,6 +386,18 @@ public void insertKVRecord(Object keyBase, long keyOffset, int keyLen,
       }
     
       /**
    +   * Merges another UnsafeExternalSorters into this one, the other one will be emptied.
    +   *
    +   * @throws IOException
    +   */
    +  public void merge(UnsafeExternalSorter other) throws IOException {
    +    other.spill();
    +    spillWriters.addAll(other.spillWriters);
    +    other.spillWriters.clear();
    +    other.cleanupResources();
    --- End diff --
    
    Ordinarily, this would end up deleting the spill files, but it doesn't because of the `spillWriters.clear()` call above. If you end up updating this patch, mind adding a one-line comment to explain this (since it's a subtle point)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153899733
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153165332
  
    **[Test build #1970 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1970/consoleFull)** for PR 9383 at commit [`df44fc6`](https://github.com/apache/spark/commit/df44fc64ed1495a1c0f6f51a7014327b6a8750b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153551889
  
    **[Test build #44972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44972/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153925290
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43567867
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -860,45 +844,48 @@ void growAndRehash() {
           resizeStartTime = System.nanoTime();
         }
         // Store references to the old data structures to be used when we re-hash
    -    final LongArray oldLongArray = longArray;
    -    final BitSet oldBitSet = bitset;
    -    final int oldCapacity = (int) oldBitSet.capacity();
    -
    -    // Allocate the new data structures
    -    try {
    -      allocate(Math.min(growthStrategy.nextCapacity(oldCapacity), MAX_CAPACITY));
    -    } catch (OutOfMemoryError oom) {
    -      longArray = oldLongArray;
    -      bitset = oldBitSet;
    -      throw oom;
    -    }
    +    final long[] oldLongArray = longArray;
    +    allocate(Math.min(growthStrategy.nextCapacity(longArray.length / 2), MAX_CAPACITY));
     
         // Re-mask (we don't recompute the hashcode because we stored all 32 bits of it)
    -    for (int pos = oldBitSet.nextSetBit(0); pos >= 0; pos = oldBitSet.nextSetBit(pos + 1)) {
    -      final long keyPointer = oldLongArray.get(pos * 2);
    -      final int hashcode = (int) oldLongArray.get(pos * 2 + 1);
    +    for (int pos = 0; pos < oldLongArray.length; pos += 2) {
    +      final long keyPointer = oldLongArray[pos];
    +      final int hashcode = (int) oldLongArray[pos + 1];
           int newPos = hashcode & mask;
           int step = 1;
    -      boolean keepGoing = true;
    -
    -      // No need to check for equality here when we insert so this has one less if branch than
    -      // the similar code path in addWithoutResize.
    -      while (keepGoing) {
    -        if (!bitset.isSet(newPos)) {
    -          bitset.set(newPos);
    -          longArray.set(newPos * 2, keyPointer);
    -          longArray.set(newPos * 2 + 1, hashcode);
    -          keepGoing = false;
    -        } else {
    -          newPos = (newPos + step) & mask;
    -          step++;
    -        }
    +      while (longArray[newPos * 2] != 0) {
    +        newPos = (newPos + step) & mask;
    +        step++;
           }
    +      longArray[newPos * 2] = keyPointer;
    +      longArray[newPos * 2 + 1] = hashcode;
         }
    -    releaseMemory(oldLongArray.memoryBlock().size());
    +
    +    releaseMemory(oldLongArray.length * 8L);
     
         if (enablePerfMetrics) {
           timeSpentResizingNs += System.nanoTime() - resizeStartTime;
         }
       }
    +
    +  /**
    +   * Return the compacted long array.
    +   */
    +  public long[] getCompactArray() {
    --- End diff --
    
    No, after this, the map is broken, should be freed later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153926511
  
    **[Test build #45078 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45078/consoleFull)** for PR 9383 at commit [`b1f8a99`](https://github.com/apache/spark/commit/b1f8a99b27727212b29fbd3e7316fa1da3b5aa23).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43955709
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -183,31 +180,30 @@ class UnsafeFixedWidthAggregationMapSuite
           false // disable perf metrics
         )
     
    +    var map = createMap()
         val keys = randomStrings(1024).take(512)
         keys.foreach { keyString =>
           val buf = map.getAggregationBuffer(InternalRow(UTF8String.fromString(keyString)))
           buf.setInt(0, keyString.length)
           assert(buf != null)
         }
    -
    -    // Convert the map into a sorter
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
         val additionalKeys = randomStrings(1024)
    -    val keyConverter = UnsafeProjection.create(groupKeySchema)
    -    val valueConverter = UnsafeProjection.create(aggBufferSchema)
    -
    +    map = createMap()
         additionalKeys.zipWithIndex.foreach { case (str, i) =>
    -      val k = InternalRow(UTF8String.fromString(str))
    -      val v = InternalRow(str.length)
    -      sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v))
    +      val buf = map.getAggregationBuffer(InternalRow(UTF8String.fromString(str)))
    +      buf.setInt(0, str.length)
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
    --- End diff --
    
    The old behavior will be used anymore, so all the tests are updated (to reflect the real use case).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153596433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45000/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153618311
  
    **[Test build #44997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44997/consoleFull)** for PR 9383 at commit [`fc5e052`](https://github.com/apache/spark/commit/fc5e052ff17560d02ef7cdeec91a4a30605c65f0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152628640
  
    **[Test build #44700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44700/consoleFull)** for PR 9383 at commit [`764f540`](https://github.com/apache/spark/commit/764f5406b05606a4c42c414357e43cac77f30511).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902056
  
    **[Test build #45062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45062/consoleFull)** for PR 9383 at commit [`10d7169`](https://github.com/apache/spark/commit/10d71694ae07af68265bb36a957b4ff5320d8e72).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153643853
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45001/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152671951
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697243
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44728/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152663282
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44708/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153643851
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902064
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45062/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43786023
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -291,28 +278,27 @@ class UnsafeFixedWidthAggregationMapSuite
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
    +    map = createMap()
         (1 to 4096).foreach { i =>
    -      sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0))
    +      map.getAggregationBufferFromUnsafeRow(UnsafeRow.createFromByteArray(0, 0))
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
     
         var count = 0
         val iter = sorter.sortedIterator()
         while (iter.next()) {
    -      // At here, we also test if copy is correct.
    -      iter.getKey.copy()
    --- End diff --
    
    Why remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43951003
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java ---
    @@ -243,9 +243,8 @@ public void printPerfMetrics() {
        * Note that this destroys the map, and as a result, the map cannot be used anymore after this.
        */
       public UnsafeKVExternalSorter destructAndCreateExternalSorter() throws IOException {
    -    UnsafeKVExternalSorter sorter = new UnsafeKVExternalSorter(
    +    return new UnsafeKVExternalSorter(
           groupingKeySchema, aggregationBufferSchema,
           SparkEnv.get().blockManager(), map.getPageSizeBytes(), map);
    --- End diff --
    
    Also, I think we need to make it clear that `map` will not be freed at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153164208
  
    **[Test build #44823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44823/consoleFull)** for PR 9383 at commit [`2e341f5`](https://github.com/apache/spark/commit/2e341f50b656d0effe36004b6abc68898a119f35).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152662959
  
    **[Test build #44708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44708/consoleFull)** for PR 9383 at commit [`55c47ed`](https://github.com/apache/spark/commit/55c47edac60f308f359cd98c00ae210ea65e2852).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153496764
  
    **[Test build #44959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153121656
  
    **[Test build #44820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)** for PR 9383 at commit [`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152661708
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152623885
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44698/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152630319
  
    **[Test build #44700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44700/consoleFull)** for PR 9383 at commit [`764f540`](https://github.com/apache/spark/commit/764f5406b05606a4c42c414357e43cac77f30511).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152663276
  
    **[Test build #44708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44708/consoleFull)** for PR 9383 at commit [`55c47ed`](https://github.com/apache/spark/commit/55c47edac60f308f359cd98c00ae210ea65e2852).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43827776
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -502,44 +511,37 @@ class TungstenAggregationIterator(
             processRow(buffer, newInput)
           }
         } else {
    -      while (!sortBased && inputIter.hasNext) {
    +      var i = 0
    +      while (inputIter.hasNext) {
             val newInput = inputIter.next()
             numInputRows += 1
             val groupingKey = groupProjection.apply(newInput)
    -        val buffer: UnsafeRow = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        var buffer: UnsafeRow = null
    +        if (i < fallbackStartsAt) {
    +          buffer = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        }
             if (buffer == null) {
    -          // buffer == null means that we could not allocate more memory.
    -          // Now, we need to spill the map and switch to sort-based aggregation.
    -          switchToSortBasedAggregation(groupingKey, newInput)
    -        } else {
    -          processRow(buffer, newInput)
    +          val sorter = hashMap.destructAndCreateExternalSorter()
    +          if (externalSorter == null) {
    +            externalSorter = sorter
    +          } else {
    +            externalSorter.merge(sorter)
    +          }
    +          i = 0
    +          hashMap = createHashMap()
    --- End diff --
    
    I mean from this part of code, it is not obvious when we free the map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153643706
  
    **[Test build #45001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45001/consoleFull)** for PR 9383 at commit [`1c0c6c3`](https://github.com/apache/spark/commit/1c0c6c36a5a16c33ceb4cd43534ce02ec3c2b286).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153466959
  
    This failure could be reproduce by 
    ```
    sqlContext.range(N).selectExpr("id", "repeat(id, 20) as s").groupBy("s").max("id").count()
    ```
    
    Maybe related to this
    https://github.com/apache/spark/pull/9038/files#diff-b83a65c9afcef83e053ecc0afecba1a4R633



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153201546
  
    **[Test build #1970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1970/consoleFull)** for PR 9383 at commit [`df44fc6`](https://github.com/apache/spark/commit/df44fc64ed1495a1c0f6f51a7014327b6a8750b7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43950528
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java ---
    @@ -243,9 +243,8 @@ public void printPerfMetrics() {
        * Note that this destroys the map, and as a result, the map cannot be used anymore after this.
        */
       public UnsafeKVExternalSorter destructAndCreateExternalSorter() throws IOException {
    -    UnsafeKVExternalSorter sorter = new UnsafeKVExternalSorter(
    +    return new UnsafeKVExternalSorter(
           groupingKeySchema, aggregationBufferSchema,
           SparkEnv.get().blockManager(), map.getPageSizeBytes(), map);
    --- End diff --
    
    Is this map usable after this call?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153528327
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43786236
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -762,15 +679,7 @@ class TungstenAggregationIterator(
       /**
        * Start processing input rows.
        */
    -  testFallbackStartsAt match {
    -    case None =>
    -      processInputs()
    -    case Some(fallbackStartsAt) =>
    -      // This is the testing path. processInputsWithControlledFallback is same as processInputs
    -      // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows
    -      // have been processed.
    -      processInputsWithControlledFallback(fallbackStartsAt)
    -  }
    +  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
    --- End diff --
    
    The reason I'd like to have a really high limit is to avoid the branch, like `testFallbackStartsAt.isDefined() && i < testFallbackStartsAt.get()`. It's a common trick, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153618536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43954148
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java ---
    @@ -386,24 +388,37 @@ public void insertKVRecord(Object keyBase, long keyOffset, int keyLen,
       }
     
       /**
    +   * Merges another UnsafeExternalSorters into this one, the other one will be emptied.
    +   *
    +   * @throws IOException
    +   */
    +  public void merge(UnsafeExternalSorter other) throws IOException {
    +    other.spill();
    +    spillWriters.addAll(other.spillWriters);
    +    // remove them from `spillWriters`, or the files will be deleted in `cleanupResources`.
    +    other.spillWriters.clear();
    +    other.cleanupResources();
    +  }
    +
    +  /**
        * Returns a sorted iterator. It is the caller's responsibility to call `cleanupResources()`
        * after consuming this iterator.
        */
       public UnsafeSorterIterator getSortedIterator() throws IOException {
    -    assert(inMemSorter != null);
    -    readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
    -    int numIteratorsToMerge = spillWriters.size() + (readingIterator.hasNext() ? 1 : 0);
         if (spillWriters.isEmpty()) {
    +      assert(inMemSorter != null);
    +      readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
           return readingIterator;
         } else {
           final UnsafeSorterSpillMerger spillMerger =
    -        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, numIteratorsToMerge);
    +        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, spillWriters.size());
           for (UnsafeSorterSpillWriter spillWriter : spillWriters) {
             spillMerger.addSpillIfNotEmpty(spillWriter.getReader(blockManager));
           }
    -      spillWriters.clear();
    --- End diff --
    
    Chatted with @JoshRosen offline, we should not clear `spillWriters` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43954658
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java ---
    @@ -386,24 +388,37 @@ public void insertKVRecord(Object keyBase, long keyOffset, int keyLen,
       }
     
       /**
    +   * Merges another UnsafeExternalSorters into this one, the other one will be emptied.
    +   *
    +   * @throws IOException
    +   */
    +  public void merge(UnsafeExternalSorter other) throws IOException {
    +    other.spill();
    +    spillWriters.addAll(other.spillWriters);
    +    // remove them from `spillWriters`, or the files will be deleted in `cleanupResources`.
    +    other.spillWriters.clear();
    +    other.cleanupResources();
    +  }
    +
    +  /**
        * Returns a sorted iterator. It is the caller's responsibility to call `cleanupResources()`
        * after consuming this iterator.
        */
       public UnsafeSorterIterator getSortedIterator() throws IOException {
    -    assert(inMemSorter != null);
    -    readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
    -    int numIteratorsToMerge = spillWriters.size() + (readingIterator.hasNext() ? 1 : 0);
         if (spillWriters.isEmpty()) {
    +      assert(inMemSorter != null);
    +      readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
           return readingIterator;
         } else {
           final UnsafeSorterSpillMerger spillMerger =
    -        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, numIteratorsToMerge);
    +        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, spillWriters.size());
           for (UnsafeSorterSpillWriter spillWriter : spillWriters) {
             spillMerger.addSpillIfNotEmpty(spillWriter.getReader(blockManager));
           }
    -      spillWriters.clear();
    --- End diff --
    
    just a note, we had a quick discussion. Seems we should not call `spillWriters.clear()`. Otherwise those spilled files will not be deleted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43954523
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -83,11 +83,8 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes);
         } else {
    -      // The memory needed for UnsafeInMemorySorter should be less than longArray in map.
    -      map.freeArray();
    -      // The memory used by UnsafeInMemorySorter will be counted later (end of this block)
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
    -        taskMemoryManager, recordComparator, prefixComparator, Math.max(1, map.numElements()));
    +        taskMemoryManager, recordComparator, prefixComparator, map.getArray());
    --- End diff --
    
    yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152669491
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44710/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153926674
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153776895
  
    **[Test build #1978 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1978/consoleFull)** for PR 9383 at commit [`1c0c6c3`](https://github.com/apache/spark/commit/1c0c6c36a5a16c33ceb4cd43534ce02ec3c2b286).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43841671
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -502,44 +511,37 @@ class TungstenAggregationIterator(
             processRow(buffer, newInput)
           }
         } else {
    -      while (!sortBased && inputIter.hasNext) {
    +      var i = 0
    +      while (inputIter.hasNext) {
             val newInput = inputIter.next()
             numInputRows += 1
             val groupingKey = groupProjection.apply(newInput)
    -        val buffer: UnsafeRow = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        var buffer: UnsafeRow = null
    +        if (i < fallbackStartsAt) {
    +          buffer = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        }
             if (buffer == null) {
    -          // buffer == null means that we could not allocate more memory.
    -          // Now, we need to spill the map and switch to sort-based aggregation.
    -          switchToSortBasedAggregation(groupingKey, newInput)
    -        } else {
    -          processRow(buffer, newInput)
    +          val sorter = hashMap.destructAndCreateExternalSorter()
    +          if (externalSorter == null) {
    +            externalSorter = sorter
    +          } else {
    +            externalSorter.merge(sorter)
    +          }
    +          i = 0
    +          hashMap = createHashMap()
    --- End diff --
    
    It's freed in destructAndCreateExternalSorter, I think that's the mean of `destruct`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152670180
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153189652
  
    **[Test build #44830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44830/consoleFull)** for PR 9383 at commit [`3864095`](https://github.com/apache/spark/commit/3864095906cb083a3a75db1811c5cc84312beeec).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697605
  
    **[Test build #44730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44730/consoleFull)** for PR 9383 at commit [`6fde4d5`](https://github.com/apache/spark/commit/6fde4d54ea7a6d3bf08684d545017e5a4a593b47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153447090
  
    @JoshRosen I meant the old implementation of switching, it's broken in master (not 1.5 branch), had not figured out the root cause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153572253
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152663280
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153591839
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153897490
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153451560
  
    I _do_ think that the change in this patch makes sense, but I'd like to really get to the bottom of why the old code malfunctioned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152670753
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153164295
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153528306
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152711765
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44730/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153167316
  
    **[Test build #44834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44834/consoleFull)** for PR 9383 at commit [`df44fc6`](https://github.com/apache/spark/commit/df44fc64ed1495a1c0f6f51a7014327b6a8750b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43950924
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -647,7 +648,7 @@ public boolean putNewKey(Object keyBase, long keyOffset, int keyLength,
           assert(bitset != null);
           assert(longArray != null);
     
    -      if (numElements == MAX_CAPACITY || !canGrowArray) {
    +      if (numElements == MAX_CAPACITY || numElements > growthThreshold && !canGrowArray) {
    --- End diff --
    
    No, we check the space in page later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43949296
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java ---
    @@ -386,24 +388,37 @@ public void insertKVRecord(Object keyBase, long keyOffset, int keyLen,
       }
     
       /**
    +   * Merges another UnsafeExternalSorters into this one, the other one will be emptied.
    +   *
    +   * @throws IOException
    +   */
    +  public void merge(UnsafeExternalSorter other) throws IOException {
    +    other.spill();
    +    spillWriters.addAll(other.spillWriters);
    +    // remove them from `spillWriters`, or the files will be deleted in `cleanupResources`.
    +    other.spillWriters.clear();
    +    other.cleanupResources();
    +  }
    +
    +  /**
        * Returns a sorted iterator. It is the caller's responsibility to call `cleanupResources()`
        * after consuming this iterator.
        */
       public UnsafeSorterIterator getSortedIterator() throws IOException {
    -    assert(inMemSorter != null);
    -    readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
    -    int numIteratorsToMerge = spillWriters.size() + (readingIterator.hasNext() ? 1 : 0);
         if (spillWriters.isEmpty()) {
    +      assert(inMemSorter != null);
    +      readingIterator = new SpillableIterator(inMemSorter.getSortedIterator());
           return readingIterator;
         } else {
           final UnsafeSorterSpillMerger spillMerger =
    -        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, numIteratorsToMerge);
    +        new UnsafeSorterSpillMerger(recordComparator, prefixComparator, spillWriters.size());
           for (UnsafeSorterSpillWriter spillWriter : spillWriters) {
             spillMerger.addSpillIfNotEmpty(spillWriter.getReader(blockManager));
           }
    -      spillWriters.clear();
    --- End diff --
    
    ping @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153897527
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43785456
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -473,24 +473,28 @@ class TungstenAggregationIterator(
       // Part 3: Methods and fields used by hash-based aggregation.
       ///////////////////////////////////////////////////////////////////////////
     
    +  private def createHashMap(): UnsafeFixedWidthAggregationMap = {
    +    new UnsafeFixedWidthAggregationMap(
    +      initialAggregationBuffer,
    +      StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
    +      StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
    +      TaskContext.get().taskMemoryManager(),
    +      1024 * 16, // initial capacity
    +      TaskContext.get().taskMemoryManager().pageSizeBytes,
    +      false // disable tracking of performance metrics
    +    )
    +  }
    +
       // This is the hash map used for hash-based aggregation. It is backed by an
       // UnsafeFixedWidthAggregationMap and it is used to store
       // all groups and their corresponding aggregation buffers for hash-based aggregation.
    -  private[this] val hashMap = new UnsafeFixedWidthAggregationMap(
    -    initialAggregationBuffer,
    -    StructType.fromAttributes(allAggregateFunctions.flatMap(_.aggBufferAttributes)),
    -    StructType.fromAttributes(groupingExpressions.map(_.toAttribute)),
    -    TaskContext.get().taskMemoryManager(),
    -    1024 * 16, // initial capacity
    -    TaskContext.get().taskMemoryManager().pageSizeBytes,
    -    false // disable tracking of performance metrics
    -  )
    +  private[this] var hashMap = createHashMap()
     
       // The function used to read and process input rows. When processing input rows,
       // it first uses hash-based aggregation by putting groups and their buffers in
       // hashMap. If we could not allocate more memory for the map, we switch to
       // sort-based aggregation (by calling switchToSortBasedAggregation).
    --- End diff --
    
    Update comment to reflect the fact that we use multiple hash-maps, spilling after each becomes full then using sort to merge the spills?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153164925
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153205634
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44834/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153504624
  
    There was a problem in my test-selection regex... trying again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43567640
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -860,45 +844,48 @@ void growAndRehash() {
           resizeStartTime = System.nanoTime();
         }
         // Store references to the old data structures to be used when we re-hash
    -    final LongArray oldLongArray = longArray;
    -    final BitSet oldBitSet = bitset;
    -    final int oldCapacity = (int) oldBitSet.capacity();
    -
    -    // Allocate the new data structures
    -    try {
    -      allocate(Math.min(growthStrategy.nextCapacity(oldCapacity), MAX_CAPACITY));
    -    } catch (OutOfMemoryError oom) {
    -      longArray = oldLongArray;
    -      bitset = oldBitSet;
    -      throw oom;
    -    }
    +    final long[] oldLongArray = longArray;
    +    allocate(Math.min(growthStrategy.nextCapacity(longArray.length / 2), MAX_CAPACITY));
     
         // Re-mask (we don't recompute the hashcode because we stored all 32 bits of it)
    -    for (int pos = oldBitSet.nextSetBit(0); pos >= 0; pos = oldBitSet.nextSetBit(pos + 1)) {
    -      final long keyPointer = oldLongArray.get(pos * 2);
    -      final int hashcode = (int) oldLongArray.get(pos * 2 + 1);
    +    for (int pos = 0; pos < oldLongArray.length; pos += 2) {
    +      final long keyPointer = oldLongArray[pos];
    +      final int hashcode = (int) oldLongArray[pos + 1];
           int newPos = hashcode & mask;
           int step = 1;
    -      boolean keepGoing = true;
    -
    -      // No need to check for equality here when we insert so this has one less if branch than
    -      // the similar code path in addWithoutResize.
    -      while (keepGoing) {
    -        if (!bitset.isSet(newPos)) {
    -          bitset.set(newPos);
    -          longArray.set(newPos * 2, keyPointer);
    -          longArray.set(newPos * 2 + 1, hashcode);
    -          keepGoing = false;
    -        } else {
    -          newPos = (newPos + step) & mask;
    -          step++;
    -        }
    +      while (longArray[newPos * 2] != 0) {
    +        newPos = (newPos + step) & mask;
    +        step++;
           }
    +      longArray[newPos * 2] = keyPointer;
    +      longArray[newPos * 2 + 1] = hashcode;
         }
    -    releaseMemory(oldLongArray.memoryBlock().size());
    +
    +    releaseMemory(oldLongArray.length * 8L);
     
         if (enablePerfMetrics) {
           timeSpentResizingNs += System.nanoTime() - resizeStartTime;
         }
       }
    +
    +  /**
    +   * Return the compacted long array.
    +   */
    +  public long[] getCompactArray() {
    --- End diff --
    
    After call `getCompactArray`, the content of `longArray` is modified. Can this `BytesToBytesMap` be normally used later? Because the position in `longArray` for a key should be determined by `(keyBase, keyOffset, keyLength)`. If the positions are modified, can the methods such as `safeLookup` work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153155439
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153591862
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152661686
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153125505
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152711759
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153503203
  
    I tried with master, it did failed with the sam exception:
    ```
    
    [info] *** 1 TEST FAILED ***
    [error] Failed: Total 6, Failed 1, Errors 0, Passed 5
    [error] Failed tests:
    [error] 	org.apache.spark.sql.hive.execution.TungstenAggregationQueryWithControlledFallbackSuite
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697240
  
    **[Test build #44728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44728/consoleFull)** for PR 9383 at commit [`28f84e1`](https://github.com/apache/spark/commit/28f84e1f2703547a5102ee5ae277e5dfbd522403).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152669489
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152696861
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43951086
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -122,11 +119,7 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes,
             inMemSorter);
    -
    -      sorter.spill();
    -      map.free();
    -      // counting the memory used UnsafeInMemorySorter
    -      taskMemoryManager.acquireExecutionMemory(inMemSorter.getMemoryUsage(), sorter);
    +      map.reset();
    --- End diff --
    
    Yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152669192
  
    **[Test build #44710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44710/consoleFull)** for PR 9383 at commit [`752c8e7`](https://github.com/apache/spark/commit/752c8e70e663d993e3f0586d99ddaef32df7a60a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153489816
  
    What do you think about adding a dedicated regression test for that one case as part of a separate PR? I'll try that now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153905171
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45064/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152625318
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152669486
  
    **[Test build #44710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44710/consoleFull)** for PR 9383 at commit [`752c8e7`](https://github.com/apache/spark/commit/752c8e70e663d993e3f0586d99ddaef32df7a60a).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153901178
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45060/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153595974
  
    **[Test build #45001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45001/consoleFull)** for PR 9383 at commit [`1c0c6c3`](https://github.com/apache/spark/commit/1c0c6c36a5a16c33ceb4cd43534ce02ec3c2b286).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152711698
  
    **[Test build #44730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44730/consoleFull)** for PR 9383 at commit [`6fde4d5`](https://github.com/apache/spark/commit/6fde4d54ea7a6d3bf08684d545017e5a4a593b47).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152683164
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44714/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43827738
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -502,44 +511,37 @@ class TungstenAggregationIterator(
             processRow(buffer, newInput)
           }
         } else {
    -      while (!sortBased && inputIter.hasNext) {
    +      var i = 0
    +      while (inputIter.hasNext) {
             val newInput = inputIter.next()
             numInputRows += 1
             val groupingKey = groupProjection.apply(newInput)
    -        val buffer: UnsafeRow = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        var buffer: UnsafeRow = null
    +        if (i < fallbackStartsAt) {
    +          buffer = hashMap.getAggregationBufferFromUnsafeRow(groupingKey)
    +        }
             if (buffer == null) {
    -          // buffer == null means that we could not allocate more memory.
    -          // Now, we need to spill the map and switch to sort-based aggregation.
    -          switchToSortBasedAggregation(groupingKey, newInput)
    -        } else {
    -          processRow(buffer, newInput)
    +          val sorter = hashMap.destructAndCreateExternalSorter()
    +          if (externalSorter == null) {
    +            externalSorter = sorter
    +          } else {
    +            externalSorter.merge(sorter)
    +          }
    +          i = 0
    +          hashMap = createHashMap()
    --- End diff --
    
    Before we create the new map, I guess it will not hurt if we call `hashMap`? We just make our clear that the previous map will be freed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43827568
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -762,15 +679,7 @@ class TungstenAggregationIterator(
       /**
        * Start processing input rows.
        */
    -  testFallbackStartsAt match {
    -    case None =>
    -      processInputs()
    -    case Some(fallbackStartsAt) =>
    -      // This is the testing path. processInputsWithControlledFallback is same as processInputs
    -      // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows
    -      // have been processed.
    -      processInputsWithControlledFallback(fallbackStartsAt)
    -  }
    +  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
    --- End diff --
    
    Each record needs 30+ bytes, it needs to have more than 60G memory for single task to trigger this spilling, I think that's fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43828335
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala ---
    @@ -589,6 +588,13 @@ abstract class AggregationQuerySuite extends QueryTest with SQLTestUtils with Te
         }
       }
     
    +  test("no aggregation function") {
    +    val df = sqlContext.range(20).selectExpr("id", "repeat(id, 1) as s")
    +      .groupBy("s").count()
    +      .groupBy().count()
    +    checkAnswer(df, Row(20) :: Nil)
    +  }
    --- End diff --
    
    Maybe add a description of the original problem at here or just add a link?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153119226
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153595672
  
    @JoshRosen @yhuai pushed a refactor on this (reduce possibility of full GC by re-use the array in map), please take another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152630329
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152668968
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152623420
  
    **[Test build #44698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44698/consoleFull)** for PR 9383 at commit [`5707f5b`](https://github.com/apache/spark/commit/5707f5b3421fdab0b01ee2a66acf50b59752152b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43786437
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -291,28 +278,27 @@ class UnsafeFixedWidthAggregationMapSuite
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
    +    map = createMap()
         (1 to 4096).foreach { i =>
    -      sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0))
    +      map.getAggregationBufferFromUnsafeRow(UnsafeRow.createFromByteArray(0, 0))
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
     
         var count = 0
         val iter = sorter.sortedIterator()
         while (iter.next()) {
    -      // At here, we also test if copy is correct.
    -      iter.getKey.copy()
    -      iter.getValue.copy()
           count += 1
         }
     
    -    // 1 record was from the map and 4096 records were explicitly inserted.
    -    assert(count === 4097)
    -
    -    map.free()
    +    // 1 record in one map.
    +    assert(count === 42)
    --- End diff --
    
    The sorter will not combine the multiple records with same keys, but map will.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153492838
  
    I filed https://issues.apache.org/jira/browse/SPARK-11486 to describe the String aggregation bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43786489
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -291,28 +278,27 @@ class UnsafeFixedWidthAggregationMapSuite
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
    +    map = createMap()
         (1 to 4096).foreach { i =>
    -      sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0))
    +      map.getAggregationBufferFromUnsafeRow(UnsafeRow.createFromByteArray(0, 0))
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
     
         var count = 0
         val iter = sorter.sortedIterator()
         while (iter.next()) {
    -      // At here, we also test if copy is correct.
    -      iter.getKey.copy()
    --- End diff --
    
    Could add that back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43947627
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -647,7 +648,7 @@ public boolean putNewKey(Object keyBase, long keyOffset, int keyLength,
           assert(bitset != null);
           assert(longArray != null);
     
    -      if (numElements == MAX_CAPACITY || !canGrowArray) {
    +      if (numElements == MAX_CAPACITY || numElements > growthThreshold && !canGrowArray) {
    --- End diff --
    
    When we have `numElements  <= growthThreshold && !canGrowArray`, it is guaranteed that our page still has space to put this key, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153953990
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9383


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153596430
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153924528
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153440260
  
    @davies, the block comment at the top of `TungstenAggregationIterator` is now out-of-date; do you mind updating it to reflect the new behavior?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43951682
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -83,11 +83,8 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes);
         } else {
    -      // The memory needed for UnsafeInMemorySorter should be less than longArray in map.
    -      map.freeArray();
    -      // The memory used by UnsafeInMemorySorter will be counted later (end of this block)
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
    -        taskMemoryManager, recordComparator, prefixComparator, Math.max(1, map.numElements()));
    +        taskMemoryManager, recordComparator, prefixComparator, map.getArray());
    --- End diff --
    
    ok. Then, this `UnsafeInMemorySorter` is the only object that can hold the reference other than the original bytes to bytes map, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152694417
  
    Currently, the old one is broken, I'd to remove that one. The new should be as fast as the old one in worst case, I think we don't need a configuration for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43785059
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -762,15 +679,7 @@ class TungstenAggregationIterator(
       /**
        * Start processing input rows.
        */
    -  testFallbackStartsAt match {
    -    case None =>
    -      processInputs()
    -    case Some(fallbackStartsAt) =>
    -      // This is the testing path. processInputsWithControlledFallback is same as processInputs
    -      // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows
    -      // have been processed.
    -      processInputsWithControlledFallback(fallbackStartsAt)
    -  }
    +  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
    --- End diff --
    
    I guess it isn't a huge deal to have a really high limit here, as opposed to no limit, because the cost of spilling after processing a couple of billion rows will be fairly minimal; either we would have spilled earlier or we're experiencing a huge reduction factor, so the cost of spilling should be minimal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153961472
  
    Merging into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43949044
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -83,11 +83,8 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes);
         } else {
    -      // The memory needed for UnsafeInMemorySorter should be less than longArray in map.
    -      map.freeArray();
    -      // The memory used by UnsafeInMemorySorter will be counted later (end of this block)
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
    -        taskMemoryManager, recordComparator, prefixComparator, Math.max(1, map.numElements()));
    +        taskMemoryManager, recordComparator, prefixComparator, map.getArray());
    --- End diff --
    
    Is there any chance that multiple objects may change the same array?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152670207
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152630333
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44700/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43785559
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -142,6 +142,10 @@ public void insertKV(UnsafeRow key, UnsafeRow value) throws IOException {
           value.getBaseObject(), value.getBaseOffset(), value.getSizeInBytes(), prefix);
       }
     
    +  public void merge(UnsafeKVExternalSorter other) throws IOException {
    --- End diff --
    
    Add comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152691155
  
    Besides, as we discussed in #9067, should we add a configuration for turning on/off this feature? This feature may not always have performance gain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697504
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153434983
  
    > Currently, the old one is broken, I'd to remove that one.
    
    @davies, are you referring to the old Aggregate1 interface or the old implementation of sort fallback here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153953872
  
    **[Test build #45078 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45078/consoleFull)** for PR 9383 at commit [`b1f8a99`](https://github.com/apache/spark/commit/b1f8a99b27727212b29fbd3e7316fa1da3b5aa23).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697242
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153451108
  
    @davies: in the stacktrace that you posted, it looks like it's failing when trying to update a string column when extracting a partial aggregate result from an input row. What's confusing to me, though, is the fact that a string column in a partial aggregate result buffer should have prevented TungstenAggregate from being planned in the first place (see `TungstenAggregate.supportsAggregate()`). Do you think that there might be an input attribute binding error that's causing this (i.e. we mistakenly treat grouping columns as belonging to the partial aggregate result)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153924544
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43786160
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -291,28 +278,27 @@ class UnsafeFixedWidthAggregationMapSuite
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
    +    map = createMap()
         (1 to 4096).foreach { i =>
    -      sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0))
    +      map.getAggregationBufferFromUnsafeRow(UnsafeRow.createFromByteArray(0, 0))
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
     
         var count = 0
         val iter = sorter.sortedIterator()
         while (iter.next()) {
    -      // At here, we also test if copy is correct.
    -      iter.getKey.copy()
    -      iter.getValue.copy()
           count += 1
         }
     
    -    // 1 record was from the map and 4096 records were explicitly inserted.
    -    assert(count === 4097)
    -
    -    map.free()
    +    // 1 record in one map.
    +    assert(count === 42)
    --- End diff --
    
    Why this change? Shouldn't the total number of records have been unchanged by this refactoring?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153527299
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43826631
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala ---
    @@ -762,15 +679,7 @@ class TungstenAggregationIterator(
       /**
        * Start processing input rows.
        */
    -  testFallbackStartsAt match {
    -    case None =>
    -      processInputs()
    -    case Some(fallbackStartsAt) =>
    -      // This is the testing path. processInputsWithControlledFallback is same as processInputs
    -      // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows
    -      // have been processed.
    -      processInputsWithControlledFallback(fallbackStartsAt)
    -  }
    +  processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue))
    --- End diff --
    
    It still possible that we basically have lot of memory space to use, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153490648
  
    Will add that into this PR, working on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153121664
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153118810
  
    After some benchmark, realized that using hashcode as prefix in timsort will cause regression in timsort and snappy compression (especially for aggregation after join, the order of records will become random). I will revert that part.
    
    benchmark code:
    ```
    sqlContext.setConf("spark.sql.shuffle.partitions", "1")
    N = 1<<25
    M = 1<<20
    df = sqlContext.range(N).selectExpr("id", "repeat(id, 2) as s")
    df.show()
    df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
    j = df.join(df2, df.id==df2.id2).groupBy(df.s).max("id", "id2")
    n = j.count()
    ```
    
    Another interesting finding is that Snappy will slowdown the spilling by 50% of end-to-end time, LZ4 will be faster than Snappy, but still 10% slower than not-compressed. Should we use `false` as the default value for `spark.shuffle.spill.compress`?(PS: tested on Mac with SSD, it may not be true on HD)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43787218
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -291,28 +278,27 @@ class UnsafeFixedWidthAggregationMapSuite
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
    +    map = createMap()
         (1 to 4096).foreach { i =>
    -      sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0))
    +      map.getAggregationBufferFromUnsafeRow(UnsafeRow.createFromByteArray(0, 0))
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
     
         var count = 0
         val iter = sorter.sortedIterator()
         while (iter.next()) {
    -      // At here, we also test if copy is correct.
    -      iter.getKey.copy()
    -      iter.getValue.copy()
           count += 1
         }
     
    -    // 1 record was from the map and 4096 records were explicitly inserted.
    -    assert(count === 4097)
    -
    -    map.free()
    +    // 1 record in one map.
    +    assert(count === 42)
    --- End diff --
    
    Ah. Update comment to clarify that we spilled 42 times, hence this number?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153590823
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43828243
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala ---
    @@ -183,31 +180,30 @@ class UnsafeFixedWidthAggregationMapSuite
           false // disable perf metrics
         )
     
    +    var map = createMap()
         val keys = randomStrings(1024).take(512)
         keys.foreach { keyString =>
           val buf = map.getAggregationBuffer(InternalRow(UTF8String.fromString(keyString)))
           buf.setInt(0, keyString.length)
           assert(buf != null)
         }
    -
    -    // Convert the map into a sorter
         val sorter = map.destructAndCreateExternalSorter()
     
         // Add more keys to the sorter and make sure the results come out sorted.
         val additionalKeys = randomStrings(1024)
    -    val keyConverter = UnsafeProjection.create(groupKeySchema)
    -    val valueConverter = UnsafeProjection.create(aggBufferSchema)
    -
    +    map = createMap()
         additionalKeys.zipWithIndex.foreach { case (str, i) =>
    -      val k = InternalRow(UTF8String.fromString(str))
    -      val v = InternalRow(str.length)
    -      sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v))
    +      val buf = map.getAggregationBuffer(InternalRow(UTF8String.fromString(str)))
    +      buf.setInt(0, str.length)
     
           if ((i % 100) == 0) {
    -        memoryManager.markExecutionAsOutOfMemoryOnce()
    -        sorter.closeCurrentPage()
    +        val sorter2 = map.destructAndCreateExternalSorter()
    +        sorter.merge(sorter2)
    +        map = createMap()
           }
         }
    +    val sorter2 = map.destructAndCreateExternalSorter()
    +    sorter.merge(sorter2)
    --- End diff --
    
    Looks like this updated test is not doing the exactly same thing with the previous version? In the old version, after we get the sorter we add new records to the sorter directly. In this updated version, we will always use merge to add spilled files to the sorter. Should we just create new tests for this new behavior (using `merge`)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153447195
  
    Do you remember what the symptoms were? Just curious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153901565
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152623883
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153618538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44997/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153900874
  
    **[Test build #45062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45062/consoleFull)** for PR 9383 at commit [`10d7169`](https://github.com/apache/spark/commit/10d71694ae07af68265bb36a957b4ff5320d8e72).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153824690
  
    **[Test build #1978 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1978/consoleFull)** for PR 9383 at commit [`1c0c6c3`](https://github.com/apache/spark/commit/1c0c6c36a5a16c33ceb4cd43534ce02ec3c2b286).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43949255
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -122,11 +119,7 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes,
             inMemSorter);
    -
    -      sorter.spill();
    -      map.free();
    -      // counting the memory used UnsafeInMemorySorter
    -      taskMemoryManager.acquireExecutionMemory(inMemSorter.getMemoryUsage(), sorter);
    +      map.reset();
    --- End diff --
    
    I think we need comments at here. Basically, we need to explain why we reset. Also, once `sorter` is created, `inMemSorter` will not be needed, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153572263
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153501959
  
    So, you tried `TungstenAggregationQueryWithControlledFallbackSuite` and it did not fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153527858
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153552009
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44972/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152671953
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44712/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153164298
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44823/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153155474
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902240
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43951008
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java ---
    @@ -83,11 +83,8 @@ public UnsafeKVExternalSorter(
             /* initialSize */ 4096,
             pageSizeBytes);
         } else {
    -      // The memory needed for UnsafeInMemorySorter should be less than longArray in map.
    -      map.freeArray();
    -      // The memory used by UnsafeInMemorySorter will be counted later (end of this block)
           final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
    -        taskMemoryManager, recordComparator, prefixComparator, Math.max(1, map.numElements()));
    +        taskMemoryManager, recordComparator, prefixComparator, map.getArray());
    --- End diff --
    
    No, the array from map is not thread safe, only accessible from current thread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153527297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153189858
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152696868
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153899674
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153506241
  
    Yep, confirmed and reproduced. If we just wanted to address that bug, we could do something like #9447, but I think the broader set of changes here are a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43567875
  
    --- Diff: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -860,45 +844,48 @@ void growAndRehash() {
           resizeStartTime = System.nanoTime();
         }
         // Store references to the old data structures to be used when we re-hash
    -    final LongArray oldLongArray = longArray;
    -    final BitSet oldBitSet = bitset;
    -    final int oldCapacity = (int) oldBitSet.capacity();
    -
    -    // Allocate the new data structures
    -    try {
    -      allocate(Math.min(growthStrategy.nextCapacity(oldCapacity), MAX_CAPACITY));
    -    } catch (OutOfMemoryError oom) {
    -      longArray = oldLongArray;
    -      bitset = oldBitSet;
    -      throw oom;
    -    }
    +    final long[] oldLongArray = longArray;
    +    allocate(Math.min(growthStrategy.nextCapacity(longArray.length / 2), MAX_CAPACITY));
     
         // Re-mask (we don't recompute the hashcode because we stored all 32 bits of it)
    -    for (int pos = oldBitSet.nextSetBit(0); pos >= 0; pos = oldBitSet.nextSetBit(pos + 1)) {
    -      final long keyPointer = oldLongArray.get(pos * 2);
    -      final int hashcode = (int) oldLongArray.get(pos * 2 + 1);
    +    for (int pos = 0; pos < oldLongArray.length; pos += 2) {
    +      final long keyPointer = oldLongArray[pos];
    +      final int hashcode = (int) oldLongArray[pos + 1];
           int newPos = hashcode & mask;
           int step = 1;
    -      boolean keepGoing = true;
    -
    -      // No need to check for equality here when we insert so this has one less if branch than
    -      // the similar code path in addWithoutResize.
    -      while (keepGoing) {
    -        if (!bitset.isSet(newPos)) {
    -          bitset.set(newPos);
    -          longArray.set(newPos * 2, keyPointer);
    -          longArray.set(newPos * 2 + 1, hashcode);
    -          keepGoing = false;
    -        } else {
    -          newPos = (newPos + step) & mask;
    -          step++;
    -        }
    +      while (longArray[newPos * 2] != 0) {
    +        newPos = (newPos + step) & mask;
    +        step++;
           }
    +      longArray[newPos * 2] = keyPointer;
    +      longArray[newPos * 2 + 1] = hashcode;
         }
    -    releaseMemory(oldLongArray.memoryBlock().size());
    +
    +    releaseMemory(oldLongArray.length * 8L);
     
         if (enablePerfMetrics) {
           timeSpentResizingNs += System.nanoTime() - resizeStartTime;
         }
       }
    +
    +  /**
    +   * Return the compacted long array.
    +   */
    +  public long[] getCompactArray() {
    --- End diff --
    
    Is it needed to add comment for it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153119257
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153164966
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153905169
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902260
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153205520
  
    **[Test build #44834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44834/consoleFull)** for PR 9383 at commit [`df44fc6`](https://github.com/apache/spark/commit/df44fc64ed1495a1c0f6f51a7014327b6a8750b7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153528665
  
    **[Test build #44972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44972/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153953991
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45078/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153156635
  
    **[Test build #44830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44830/consoleFull)** for PR 9383 at commit [`3864095`](https://github.com/apache/spark/commit/3864095906cb083a3a75db1811c5cc84312beeec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153482419
  
    @JoshRosen The cause is that if the buffer is an empty one (e.g. we do not have agg function or all agg functions do not use any buffer slot), the initialInputBufferOffset is still 0 and when the key is a string, the following code will trigger the exception. Unfortunately, our test only use int columns as the key columns.
    
    ```
    newMutableProjection(
      originalInputAttributes.drop(initialInputBufferOffset),
      originalInputAttributes)()
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153448265
  
    @JoshRosen @davies has a case to reproduce that problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152697511
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152623879
  
    **[Test build #44698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44698/consoleFull)** for PR 9383 at commit [`5707f5b`](https://github.com/apache/spark/commit/5707f5b3421fdab0b01ee2a66acf50b59752152b).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152668952
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153205632
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9383#discussion_r43955966
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java ---
    @@ -243,9 +243,8 @@ public void printPerfMetrics() {
        * Note that this destroys the map, and as a result, the map cannot be used anymore after this.
        */
       public UnsafeKVExternalSorter destructAndCreateExternalSorter() throws IOException {
    -    UnsafeKVExternalSorter sorter = new UnsafeKVExternalSorter(
    +    return new UnsafeKVExternalSorter(
           groupingKeySchema, aggregationBufferSchema,
           SparkEnv.get().blockManager(), map.getPageSizeBytes(), map);
    --- End diff --
    
    ok. I see, we cannot directly insert records to the external sorter returned by `destructAndCreateExternalSorter`. We need to use merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153925266
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152683137
  
    **[Test build #44714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44714/consoleFull)** for PR 9383 at commit [`0512f1e`](https://github.com/apache/spark/commit/0512f1e5f91b81af669e7fdbee26da764e9deb02).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153901175
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve hybrid agg...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-153902062
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9383#issuecomment-152625342
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org