You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/04/13 03:53:26 UTC

[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/12345

    [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using AggregateHashMap

    ## What changes were proposed in this pull request?
    
    This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
    
    Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.
    
    ## How was this patch tested?
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
        Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
        Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        codegen = F                              2124 / 2204          9.9         101.3       1.0X
        codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
        codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark tungsten-aggregate-integration

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12345
    
----
commit 7c158bd137f057453d17ef360906e5be90bf5004
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-03-31T21:15:34Z

    [SPARK-14394]

commit ebaea6a87b704afedd47bdd2dd17c92c3ffc6e8e
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-07T00:37:08Z

    Integrating AggregateHashMap for Aggregates with Group By

commit cee7e65b3cf7569b4e46941158f164c2130c3981
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-12T17:33:42Z

    Add SQLConf

commit 8c9e17a1d40e3014e39b1d04f3a458aa129784f8
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-12T23:01:03Z

    20ns

commit 3379294b76d91a55dbe86e31efb9812c8d37768c
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-12T23:18:36Z

    generated code

commit 4ee56873764d62efdaf8c47cb74aa399f2194fde
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-13T01:23:27Z

    benchmark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209791378
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209790828
  
    **[Test build #55797 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55797/consoleFull)** for PR 12345 at commit [`041c001`](https://github.com/apache/spark/commit/041c001957999de8d383a10c0b7126c62f98ad9b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209695875
  
    @sameeragarwal have you updated the generated code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59814567
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -437,17 +444,21 @@ case class TungstenAggregate(
         val initAgg = ctx.freshName("initAgg")
         ctx.addMutableState("boolean", initAgg, s"$initAgg = false;")
     
    -    // create AggregateHashMap
    -    val isAggregateHashMapEnabled: Boolean = false
    -    val isAggregateHashMapSupported: Boolean =
    +    // We currently only enable vectorized hashmap for long key/value types
    +    isVectorizedHashMapEnabled = isVectorizedHashMapEnabled &&
    --- End diff --
    
    I think it's weird that you have the check split between here and line 268. maybe combine all these checks to 268 and make isVecdtorizedEnabled a val.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59650089
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -533,56 +578,104 @@ case class TungstenAggregate(
     
         val inputAttr = aggregateBufferAttributes ++ child.output
         ctx.currentVars = new Array[ExprCode](aggregateBufferAttributes.length) ++ input
    +
    +    ctx.INPUT_ROW = aggregateRow
    +    // TODO: support subexpression elimination
    +    val aggregateRowEvals = updateExpr.map(BindReferences.bindReference(_, inputAttr).gen(ctx))
    --- End diff --
    
    This code needs some more comments to unerstand the high level parts. It's not obviuos why you do this twice, once with ctx.INPUT_ROW = aggregateRow and once again with buffer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209622188
  
    **[Test build #55735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55735/consoleFull)** for PR 12345 at commit [`fc6b8cb`](https://github.com/apache/spark/commit/fc6b8cb337e11d3d92f0f13828b8a0b85e30929c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209693606
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55750/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209290930
  
    I don't think this works if we have NULL keys. This is kind of annoying, let's think about that tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59649417
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ColumnarAggMapCodeGenerator.scala ---
    @@ -65,27 +69,43 @@ class ColumnarAggMapCodeGenerator(
               .mkString("\n")};
           """.stripMargin
     
    +    val generatedAggBufferSchema: String =
    +      s"""
    +         |new org.apache.spark.sql.types.StructType()
    +         |${bufferSchema.map(key =>
    +        s""".add("${key.name}", org.apache.spark.sql.types.DataTypes.${key.dataType})""")
    +        .mkString("\n")};
    +      """.stripMargin
    +
         s"""
            |  private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    +       |  private org.apache.spark.sql.execution.vectorized.ColumnarBatch aggregateBufferBatch;
            |  private int[] buckets;
            |  private int numBuckets;
            |  private int maxSteps;
            |  private int numRows = 0;
            |  private org.apache.spark.sql.types.StructType schema = $generatedSchema
    +       |  private org.apache.spark.sql.types.StructType aggregateBufferSchema =
    +       |    $generatedAggBufferSchema
            |
    -       |  public $generatedClassName(int capacity, double loadFactor, int maxSteps) {
    -       |    assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
    -       |    this.maxSteps = maxSteps;
    -       |    numBuckets = (int) (capacity / loadFactor);
    +       |  public $generatedClassName() {
    +       |    // TODO: These should be generated based on the schema
    +       |    int DEFAULT_CAPACITY = 1 << 16;
    +       |    double DEFAULT_LOAD_FACTOR = 0.25;
    +       |    int DEFAULT_MAX_STEPS = 2;
    +       |    assert (DEFAULT_CAPACITY > 0 && ((DEFAULT_CAPACITY & (DEFAULT_CAPACITY - 1)) == 0));
    +       |    this.maxSteps = DEFAULT_MAX_STEPS;
    +       |    numBuckets = (int) (DEFAULT_CAPACITY / DEFAULT_LOAD_FACTOR);
            |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    -       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    +       |      org.apache.spark.memory.MemoryMode.ON_HEAP, DEFAULT_CAPACITY);
    +       |    aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    +       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, DEFAULT_CAPACITY);
    --- End diff --
    
    Let's leave a TODO to fix this. There should be a nicer way to get a projection of a batch instead of this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209694646
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210236603
  
    **[Test build #55869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55869/consoleFull)** for PR 12345 at commit [`ec66a54`](https://github.com/apache/spark/commit/ec66a5404d07da359763e4e8f8af909d045d3eb1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209190745
  
    cc @nongli 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209693602
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209622370
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12345


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59637455
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ColumnarAggMapCodeGenerator.scala ---
    @@ -103,7 +119,7 @@ class ColumnarAggMapCodeGenerator(
         s"""
            |// TODO: Improve this hash function
            |private long hash($groupingKeySignature) {
    -       |  return ${groupingKeys.map(_._2).mkString(" ^ ")};
    +       |  return ${groupingKeys.map(_._2).mkString(" | ")};
    --- End diff --
    
    No particular reason, was planning to do this as part of a separate small PR (along with the benchmarks). Please let me know if you'd prefer it here instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209671390
  
    **[Test build #55750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55750/consoleFull)** for PR 12345 at commit [`ececd57`](https://github.com/apache/spark/commit/ececd5770e0c5410cf3da463f3b52db467d1f5ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210249105
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210261671
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209694476
  
    **[Test build #55753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55753/consoleFull)** for PR 12345 at commit [`0ca0db1`](https://github.com/apache/spark/commit/0ca0db17130bb6a1e59cdbc699f5abd946821d44).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59649964
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ColumnarAggMapCodeGenerator.scala ---
    @@ -21,10 +21,10 @@ import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
     import org.apache.spark.sql.types.StructType
     
     /**
    - * This is a helper object to generate an append-only single-key/single value aggregate hash
    - * map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates
    - * (and fall back to the `BytesToBytesMap` if a given key isn't found). This is 'codegened' in
    - * TungstenAggregate to speed up aggregates w/ key.
    + * This is a helper class to generate an append-only aggregate hash map that can act as a 'cache'
    + * for extremely fast key-value lookups while evaluating aggregates (and fall back to the
    + * `BytesToBytesMap` if a given key isn't found). This is 'codegened' in TungstenAggregate to speed
    + * up aggregates w/ key.
    --- End diff --
    
    You chose to have this not handle null keys right? Comment that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209208765
  
    **[Test build #55677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55677/consoleFull)** for PR 12345 at commit [`4ee5687`](https://github.com/apache/spark/commit/4ee56873764d62efdaf8c47cb74aa399f2194fde).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209593060
  
    **[Test build #55735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55735/consoleFull)** for PR 12345 at commit [`fc6b8cb`](https://github.com/apache/spark/commit/fc6b8cb337e11d3d92f0f13828b8a0b85e30929c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59637760
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala ---
    @@ -153,16 +153,36 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
       ignore("aggregate with keys") {
         val N = 20 << 20
     
    -    runBenchmark("Aggregate w keys", N) {
    -      sqlContext.range(N).selectExpr("(id & 65535) as k").groupBy("k").sum().collect()
    +    val benchmark = new Benchmark("Aggregate w keys", N)
    --- End diff --
    
    Extended `testFallbackStartsAt` to now support 2 offsets (for aggregate hash map and bytes to bytes map) and modified `AggregationQuerySuite` such that all existing tests test for all new codepaths.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209622374
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55735/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209791385
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55797/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209209164
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55677/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209771837
  
    **[Test build #55797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55797/consoleFull)** for PR 12345 at commit [`041c001`](https://github.com/apache/spark/commit/041c001957999de8d383a10c0b7126c62f98ad9b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59814608
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -261,7 +263,12 @@ case class TungstenAggregate(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for HashMap
    +  // The name for Vectorized HashMap
    +  private var vectorizedHashMapTerm: String = _
    +  private var isVectorizedHashMapEnabled: Boolean = sqlContext.conf.columnarAggregateMapEnabled &&
    +    (modes.contains(Partial) || modes.contains(PartialMerge))
    --- End diff --
    
    I don't think this is the correct logic. I think modes.forall( is Partial) is what we want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209210454
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55678/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59505821
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ColumnarAggMapCodeGenerator.scala ---
    @@ -103,7 +119,7 @@ class ColumnarAggMapCodeGenerator(
         s"""
            |// TODO: Improve this hash function
            |private long hash($groupingKeySignature) {
    -       |  return ${groupingKeys.map(_._2).mkString(" ^ ")};
    +       |  return ${groupingKeys.map(_._2).mkString(" | ")};
    --- End diff --
    
    Any reason not do implmeent the h = h * 37 + v hash function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209209159
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59649860
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -533,56 +578,104 @@ case class TungstenAggregate(
     
         val inputAttr = aggregateBufferAttributes ++ child.output
         ctx.currentVars = new Array[ExprCode](aggregateBufferAttributes.length) ++ input
    +
    +    ctx.INPUT_ROW = aggregateRow
    +    // TODO: support subexpression elimination
    --- End diff --
    
    let's remove this todo in both places, not sure what it means anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209210278
  
    **[Test build #55678 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55678/consoleFull)** for PR 12345 at commit [`c2fc385`](https://github.com/apache/spark/commit/c2fc38584dd073036a1f04f7cd7da9fcf50739e8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210261256
  
    **[Test build #55886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55886/consoleFull)** for PR 12345 at commit [`9b5ee1d`](https://github.com/apache/spark/commit/9b5ee1d9d648ae454cc0749047528509a032c7ab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210261676
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55886/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209696691
  
    @nongli just did!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209190135
  
    Generate code for a query of the form `sqlContext.range(N).selectExpr("(id & 65535) as k").groupBy("k").sum().collect()` looks like:
    
    ```java
    /* 009 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 010 */   private Object[] references;
    /* 011 */   private boolean agg_initAgg;
    /* 012 */   private agg_GeneratedAggregateHashMap agg_aggregateHashMap;
    /* 013 */   private java.util.Iterator<org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row> agg_genMapIter;
    /* 014 */   private org.apache.spark.sql.execution.aggregate.TungstenAggregate agg_plan;
    /* 015 */   private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
    /* 016 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter;
    /* 017 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
    /* 018 */   private org.apache.spark.sql.execution.metric.LongSQLMetric range_numOutputRows;
    /* 019 */   private org.apache.spark.sql.execution.metric.LongSQLMetricValue range_metricValue;
    /* 020 */   private boolean range_initRange;
    /* 021 */   private long range_partitionEnd;
    /* 022 */   private long range_number;
    /* 023 */   private boolean range_overflow;
    /* 024 */   private scala.collection.Iterator range_input;
    /* 025 */   private UnsafeRow range_result;
    /* 026 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder range_holder;
    /* 027 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter range_rowWriter;
    /* 028 */   private UnsafeRow project_result;
    /* 029 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
    /* 030 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
    /* 031 */   private UnsafeRow agg_result;
    /* 032 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
    /* 033 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
    /* 034 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowJoiner agg_unsafeRowJoiner;
    /* 035 */   private org.apache.spark.sql.execution.metric.LongSQLMetric wholestagecodegen_numOutputRows;
    /* 036 */   private org.apache.spark.sql.execution.metric.LongSQLMetricValue wholestagecodegen_metricValue;
    /* 037 */   private UnsafeRow wholestagecodegen_result;
    /* 038 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder wholestagecodegen_holder;
    /* 039 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter wholestagecodegen_rowWriter;
    /* 040 */   
    /* 041 */   public GeneratedIterator(Object[] references) {
    /* 042 */     this.references = references;
    /* 043 */   }
    /* 044 */   
    /* 045 */   public void init(int index, scala.collection.Iterator inputs[]) {
    /* 046 */     partitionIndex = index;
    /* 047 */     agg_initAgg = false;
    /* 048 */     agg_aggregateHashMap = new agg_GeneratedAggregateHashMap();
    /* 049 */     
    /* 050 */     this.agg_plan = (org.apache.spark.sql.execution.aggregate.TungstenAggregate) references[0];
    /* 051 */     
    /* 052 */     this.range_numOutputRows = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[1];
    /* 053 */     range_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) range_numOutputRows.localValue();
    /* 054 */     range_initRange = false;
    /* 055 */     range_partitionEnd = 0L;
    /* 056 */     range_number = 0L;
    /* 057 */     range_overflow = false;
    /* 058 */     range_input = inputs[0];
    /* 059 */     range_result = new UnsafeRow(1);
    /* 060 */     this.range_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(range_result, 0);
    /* 061 */     this.range_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(range_holder, 1);
    /* 062 */     project_result = new UnsafeRow(1);
    /* 063 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 0);
    /* 064 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1);
    /* 065 */     agg_result = new UnsafeRow(1);
    /* 066 */     this.agg_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, 0);
    /* 067 */     this.agg_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder, 1);
    /* 068 */     agg_unsafeRowJoiner = agg_plan.createUnsafeJoiner();
    /* 069 */     this.wholestagecodegen_numOutputRows = (org.apache.spark.sql.execution.metric.LongSQLMetric) references[2];
    /* 070 */     wholestagecodegen_metricValue = (org.apache.spark.sql.execution.metric.LongSQLMetricValue) wholestagecodegen_numOutputRows.localValue();
    /* 071 */     wholestagecodegen_result = new UnsafeRow(2);
    /* 072 */     this.wholestagecodegen_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(wholestagecodegen_result, 0);
    /* 073 */     this.wholestagecodegen_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(wholestagecodegen_holder, 2);
    /* 074 */   }
    /* 075 */   
    /* 076 */   public class agg_GeneratedAggregateHashMap {
    /* 077 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    /* 078 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch aggregateBufferBatch;
    /* 079 */     private int[] buckets;
    /* 080 */     private int numBuckets;
    /* 081 */     private int maxSteps;
    /* 082 */     private int numRows = 0;
    /* 083 */     private org.apache.spark.sql.types.StructType schema =
    /* 084 */     new org.apache.spark.sql.types.StructType()
    /* 085 */     .add("k", org.apache.spark.sql.types.DataTypes.LongType)
    /* 086 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
    /* 087 */     
    /* 088 */     private org.apache.spark.sql.types.StructType aggregateBufferSchema =
    /* 089 */     
    /* 090 */     new org.apache.spark.sql.types.StructType()
    /* 091 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
    /* 092 */     
    /* 093 */     public agg_GeneratedAggregateHashMap() {
    /* 094 */       // TODO: These should be generated based on the schema
    /* 095 */       int DEFAULT_CAPACITY = 1 << 16;
    /* 096 */       double DEFAULT_LOAD_FACTOR = 0.25;
    /* 097 */       int DEFAULT_MAX_STEPS = 2;
    /* 098 */       assert (DEFAULT_CAPACITY > 0 && ((DEFAULT_CAPACITY & (DEFAULT_CAPACITY - 1)) == 0));
    /* 099 */       this.maxSteps = DEFAULT_MAX_STEPS;
    /* 100 */       numBuckets = (int) (DEFAULT_CAPACITY / DEFAULT_LOAD_FACTOR);
    /* 101 */       batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    /* 102 */         org.apache.spark.memory.MemoryMode.ON_HEAP, DEFAULT_CAPACITY);
    /* 103 */       aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    /* 104 */         aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, DEFAULT_CAPACITY);
    /* 105 */       for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
    /* 106 */         aggregateBufferBatch.setColumn(i, batch.column(i+1));
    /* 107 */       }
    /* 108 */       buckets = new int[numBuckets];
    /* 109 */       java.util.Arrays.fill(buckets, -1);
    /* 110 */     }
    /* 111 */     
    /* 112 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(long agg_key) {
    /* 113 */       long h = hash(agg_key);
    /* 114 */       int step = 0;
    /* 115 */       int idx = (int) h & (numBuckets - 1);
    /* 116 */       while (step < maxSteps) {
    /* 117 */         // Return bucket index if it's either an empty slot or already contains the key
    /* 118 */         if (buckets[idx] == -1) {
    /* 119 */           batch.column(0).putLong(numRows, agg_key);
    /* 120 */           batch.column(1).putLong(numRows, 0);
    /* 121 */           buckets[idx] = numRows++;
    /* 122 */           batch.setNumRows(numRows);
    /* 123 */           return aggregateBufferBatch.getRow(buckets[idx]);
    /* 124 */         } else if (equals(idx, agg_key)) {
    /* 125 */           return aggregateBufferBatch.getRow(buckets[idx]);
    /* 126 */         }
    /* 127 */         idx = (idx + 1) & (numBuckets - 1);
    /* 128 */         step++;
    /* 129 */       }
    /* 130 */       // Didn't find it
    /* 131 */       return null;
    /* 132 */     }
    /* 133 */     
    /* 134 */     private boolean equals(int idx, long agg_key) {
    /* 135 */       return batch.column(0).getLong(buckets[idx]) == agg_key;
    /* 136 */     }
    /* 137 */     
    /* 138 */     // TODO: Improve this hash function
    /* 139 */     private long hash(long agg_key) {
    /* 140 */       return agg_key;
    /* 141 */     }
    /* 142 */     
    /* 143 */   }
    /* 144 */   
    /* 145 */   private void agg_doAggregateWithKeys() throws java.io.IOException {
    /* 146 */     agg_hashMap = agg_plan.createHashMap();
    /* 147 */     
    /* 148 */     /*** PRODUCE: Project [(id#224L & 65535) AS k#227L] */
    /* 149 */     
    /* 150 */     /*** PRODUCE: Range 0, 1, 1, 20971520, [id#224L] */
    /* 151 */     
    /* 152 */     // initialize Range
    /* 153 */     if (!range_initRange) {
    /* 154 */       range_initRange = true;
    /* 155 */       initRange(partitionIndex);
    /* 156 */     }
    /* 157 */     
    /* 158 */     while (!range_overflow && range_number < range_partitionEnd) {
    /* 159 */       long range_value = range_number;
    /* 160 */       range_number += 1L;
    /* 161 */       if (range_number < range_value ^ 1L < 0) {
    /* 162 */         range_overflow = true;
    /* 163 */       }
    /* 164 */       
    /* 165 */       /*** CONSUME: Project [(id#224L & 65535) AS k#227L] */
    /* 166 */       
    /* 167 */       /*** CONSUME: TungstenAggregate(key=[k#227L], functions=[(sum(k#227L),mode=Partial,isDistinct=false)], output=[k#227L,sum#235L]) */
    /* 168 */       /* (input[0, bigint] & 65535) */
    /* 169 */       long project_value = -1L;
    /* 170 */       project_value = range_value & 65535L;
    /* 171 */       
    /* 172 */       UnsafeRow agg_aggBuffer = null;
    /* 173 */       org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row agg_aggregateRow = null;
    /* 174 */       
    /* 175 */       agg_aggregateRow =
    /* 176 */       agg_aggregateHashMap.findOrInsert(project_value);
    /* 177 */       
    /* 178 */       if (agg_aggregateRow == null) {
    /* 179 */         // generate grouping key
    /* 180 */         agg_rowWriter.write(0, project_value);
    /* 181 */         /* hash(input[0, bigint], 42) */
    /* 182 */         int agg_value3 = 42;
    /* 183 */         
    /* 184 */         agg_value3 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(project_value, agg_value3);
    /* 185 */         if (true) {
    /* 186 */           // try to get the buffer from hash map
    /* 187 */           agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value3);
    /* 188 */         }
    /* 189 */         if (agg_aggBuffer == null) {
    /* 190 */           if (agg_sorter == null) {
    /* 191 */             agg_sorter = agg_hashMap.destructAndCreateExternalSorter();
    /* 192 */           } else {
    /* 193 */             agg_sorter.merge(agg_hashMap.destructAndCreateExternalSorter());
    /* 194 */           }
    /* 195 */           
    /* 196 */           // the hash map had be spilled, it should have enough memory now,
    /* 197 */           // try  to allocate buffer again.
    /* 198 */           agg_aggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow(agg_result, agg_value3);
    /* 199 */           if (agg_aggBuffer == null) {
    /* 200 */             // failed to allocate the first page
    /* 201 */             throw new OutOfMemoryError("No enough memory for aggregation");
    /* 202 */           }
    /* 203 */         }
    /* 204 */       }
    /* 205 */       
    /* 206 */       if (agg_aggregateRow != null) {
    /* 207 */         // evaluate aggregate function
    /* 208 */         /* (coalesce(input[0, bigint], cast(0 as bigint)) + cast(input[1, bigint] as bigint)) */
    /* 209 */         /* coalesce(input[0, bigint], cast(0 as bigint)) */
    /* 210 */         /* input[0, bigint] */
    /* 211 */         boolean agg_isNull6 = agg_aggregateRow.isNullAt(0);
    /* 212 */         long agg_value7 = agg_isNull6 ? -1L : (agg_aggregateRow.getLong(0));
    /* 213 */         boolean agg_isNull5 = agg_isNull6;
    /* 214 */         long agg_value6 = agg_value7;
    /* 215 */         
    /* 216 */         if (agg_isNull5) {
    /* 217 */           /* cast(0 as bigint) */
    /* 218 */           boolean agg_isNull7 = false;
    /* 219 */           long agg_value8 = -1L;
    /* 220 */           if (!false) {
    /* 221 */             agg_value8 = (long) 0;
    /* 222 */           }
    /* 223 */           if (!agg_isNull7) {
    /* 224 */             agg_isNull5 = false;
    /* 225 */             agg_value6 = agg_value8;
    /* 226 */           }
    /* 227 */         }
    /* 228 */         /* cast(input[1, bigint] as bigint) */
    /* 229 */         boolean agg_isNull9 = false;
    /* 230 */         long agg_value10 = -1L;
    /* 231 */         if (!false) {
    /* 232 */           agg_value10 = project_value;
    /* 233 */         }
    /* 234 */         long agg_value5 = -1L;
    /* 235 */         agg_value5 = agg_value6 + agg_value10;
    /* 236 */         // update aggregate row
    /* 237 */         agg_aggregateRow.setLong(0, agg_value5);
    /* 238 */       } else {
    /* 239 */         // evaluate aggregate function
    /* 240 */         /* (coalesce(input[0, bigint], cast(0 as bigint)) + cast(input[1, bigint] as bigint)) */
    /* 241 */         /* coalesce(input[0, bigint], cast(0 as bigint)) */
    /* 242 */         /* input[0, bigint] */
    /* 243 */         boolean agg_isNull13 = agg_aggBuffer.isNullAt(0);
    /* 244 */         long agg_value14 = agg_isNull13 ? -1L : (agg_aggBuffer.getLong(0));
    /* 245 */         boolean agg_isNull12 = agg_isNull13;
    /* 246 */         long agg_value13 = agg_value14;
    /* 247 */         
    /* 248 */         if (agg_isNull12) {
    /* 249 */           /* cast(0 as bigint) */
    /* 250 */           boolean agg_isNull14 = false;
    /* 251 */           long agg_value15 = -1L;
    /* 252 */           if (!false) {
    /* 253 */             agg_value15 = (long) 0;
    /* 254 */           }
    /* 255 */           if (!agg_isNull14) {
    /* 256 */             agg_isNull12 = false;
    /* 257 */             agg_value13 = agg_value15;
    /* 258 */           }
    /* 259 */         }
    /* 260 */         /* cast(input[1, bigint] as bigint) */
    /* 261 */         boolean agg_isNull16 = false;
    /* 262 */         long agg_value17 = -1L;
    /* 263 */         if (!false) {
    /* 264 */           agg_value17 = project_value;
    /* 265 */         }
    /* 266 */         long agg_value12 = -1L;
    /* 267 */         agg_value12 = agg_value13 + agg_value17;
    /* 268 */         // update aggregate buffer
    /* 269 */         agg_aggBuffer.setLong(0, agg_value12);
    /* 270 */       }
    /* 271 */       
    /* 272 */       if (shouldStop()) return;
    /* 273 */     }
    /* 274 */     
    /* 275 */     agg_genMapIter = agg_aggregateHashMap.batch.rowIterator();
    /* 276 */     
    /* 277 */     agg_mapIter = agg_plan.finishAggregate(agg_hashMap, agg_sorter);
    /* 278 */   }
    /* 279 */   
    /* 280 */   private void initRange(int idx) {
    /* 281 */     java.math.BigInteger index = java.math.BigInteger.valueOf(idx);
    /* 282 */     java.math.BigInteger numSlice = java.math.BigInteger.valueOf(1L);
    /* 283 */     java.math.BigInteger numElement = java.math.BigInteger.valueOf(20971520L);
    /* 284 */     java.math.BigInteger step = java.math.BigInteger.valueOf(1L);
    /* 285 */     java.math.BigInteger start = java.math.BigInteger.valueOf(0L);
    /* 286 */     
    /* 287 */     java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start);
    /* 288 */     if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
    /* 289 */       range_number = Long.MAX_VALUE;
    /* 290 */     } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
    /* 291 */       range_number = Long.MIN_VALUE;
    /* 292 */     } else {
    /* 293 */       range_number = st.longValue();
    /* 294 */     }
    /* 295 */     
    /* 296 */     java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice)
    /* 297 */     .multiply(step).add(start);
    /* 298 */     if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
    /* 299 */       range_partitionEnd = Long.MAX_VALUE;
    /* 300 */     } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
    /* 301 */       range_partitionEnd = Long.MIN_VALUE;
    /* 302 */     } else {
    /* 303 */       range_partitionEnd = end.longValue();
    /* 304 */     }
    /* 305 */     
    /* 306 */     range_metricValue.add((range_partitionEnd - range_number) / 1L);
    /* 307 */   }
    /* 308 */   
    /* 309 */   protected void processNext() throws java.io.IOException {
    /* 310 */     /*** PRODUCE: TungstenAggregate(key=[k#227L], functions=[(sum(k#227L),mode=Partial,isDistinct=false)], output=[k#227L,sum#235L]) */
    /* 311 */     
    /* 312 */     if (!agg_initAgg) {
    /* 313 */       agg_initAgg = true;
    /* 314 */       agg_doAggregateWithKeys();
    /* 315 */     }
    /* 316 */     
    /* 317 */     // output the result
    /* 318 */     
    /* 319 */     while (agg_genMapIter.hasNext()) {
    /* 320 */       wholestagecodegen_metricValue.add(1);
    /* 321 */       org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row wholestagecodegen_aggregateHashMapRow =
    /* 322 */       (org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row)
    /* 323 */       agg_genMapIter.next();
    /* 324 */       
    /* 325 */       wholestagecodegen_rowWriter.zeroOutNullBytes();
    /* 326 */       
    /* 327 */       /* input[0, bigint] */
    /* 328 */       long wholestagecodegen_value = wholestagecodegen_aggregateHashMapRow.getLong(0);
    /* 329 */       wholestagecodegen_rowWriter.write(0, wholestagecodegen_value);
    /* 330 */       
    /* 331 */       /* input[1, bigint] */
    /* 332 */       boolean wholestagecodegen_isNull1 = wholestagecodegen_aggregateHashMapRow.isNullAt(1);
    /* 333 */       long wholestagecodegen_value1 = wholestagecodegen_isNull1 ? -1L : (wholestagecodegen_aggregateHashMapRow.getLong(1));
    /* 334 */       if (wholestagecodegen_isNull1) {
    /* 335 */         wholestagecodegen_rowWriter.setNullAt(1);
    /* 336 */       } else {
    /* 337 */         wholestagecodegen_rowWriter.write(1, wholestagecodegen_value1);
    /* 338 */       }
    /* 339 */       
    /* 340 */       /*** CONSUME: WholeStageCodegen */
    /* 341 */       
    /* 342 */       append(wholestagecodegen_result);
    /* 343 */       
    /* 344 */       if (shouldStop()) return;
    /* 345 */     }
    /* 346 */     
    /* 347 */     agg_aggregateHashMap.batch.close();
    /* 348 */     
    /* 349 */     while (agg_mapIter.next()) {
    /* 350 */       wholestagecodegen_metricValue.add(1);
    /* 351 */       UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
    /* 352 */       UnsafeRow agg_aggBuffer1 = (UnsafeRow) agg_mapIter.getValue();
    /* 353 */       
    /* 354 */       UnsafeRow agg_resultRow = agg_unsafeRowJoiner.join(agg_aggKey, agg_aggBuffer1);
    /* 355 */       
    /* 356 */       /*** CONSUME: WholeStageCodegen */
    /* 357 */       
    /* 358 */       append(agg_resultRow);
    /* 359 */       
    /* 360 */       if (shouldStop()) return;
    /* 361 */     }
    /* 362 */     
    /* 363 */     agg_mapIter.close();
    /* 364 */     if (agg_sorter == null) {
    /* 365 */       agg_hashMap.free();
    /* 366 */     }
    /* 367 */   }
    /* 368 */ }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209290367
  
    
    /* 120 */           batch.column(1).putLong(numRows, 0);
    I don't think this is right. You should initialize the agg exprs to NULL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59649610
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -484,13 +497,42 @@ case class TungstenAggregate(
         // so `copyResult` should be reset to `false`.
         ctx.copyResult = false
     
    +    def outputFromGeneratedMap: Option[String] = {
    +      if (isAggregateHashMapEnabled) {
    +        val row = ctx.freshName("aggregateHashMapRow")
    --- End diff --
    
    Comment at a high level what this is doing. Something like. " Iterate over the aggregate rows and convert them from ColumnarBatch.Row to UnsafeRow"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59506077
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ColumnarAggMapCodeGenerator.scala ---
    @@ -65,27 +65,43 @@ class ColumnarAggMapCodeGenerator(
               .mkString("\n")};
           """.stripMargin
     
    +    val generatedAggBufferSchema: String =
    +      s"""
    +         |new org.apache.spark.sql.types.StructType()
    +         |${bufferSchema.map(key =>
    +        s""".add("${key.name}", org.apache.spark.sql.types.DataTypes.${key.dataType})""")
    +        .mkString("\n")};
    +      """.stripMargin
    +
         s"""
    -       |  private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    +       |  public org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    --- End diff --
    
    Can we not make either of these public. Also, have a comment that explains why you have two batches their relation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209210452
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209694650
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55753/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209675767
  
    **[Test build #55753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55753/consoleFull)** for PR 12345 at commit [`0ca0db1`](https://github.com/apache/spark/commit/0ca0db17130bb6a1e59cdbc699f5abd946821d44).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209192126
  
    **[Test build #55678 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55678/consoleFull)** for PR 12345 at commit [`c2fc385`](https://github.com/apache/spark/commit/c2fc38584dd073036a1f04f7cd7da9fcf50739e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209190140
  
    **[Test build #55677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55677/consoleFull)** for PR 12345 at commit [`4ee5687`](https://github.com/apache/spark/commit/4ee56873764d62efdaf8c47cb74aa399f2194fde).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59505993
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala ---
    @@ -153,16 +153,36 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
       ignore("aggregate with keys") {
         val N = 20 << 20
     
    -    runBenchmark("Aggregate w keys", N) {
    -      sqlContext.range(N).selectExpr("(id & 65535) as k").groupBy("k").sum().collect()
    +    val benchmark = new Benchmark("Aggregate w keys", N)
    --- End diff --
    
    Do we have tests that exercise the correctness here? It should group by a varying number of keys to test the paths you addd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209771906
  
    @nongli thanks, comments addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210208658
  
    **[Test build #55869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55869/consoleFull)** for PR 12345 at commit [`ec66a54`](https://github.com/apache/spark/commit/ec66a5404d07da359763e4e8f8af909d045d3eb1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210237677
  
    **[Test build #55886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55886/consoleFull)** for PR 12345 at commit [`9b5ee1d`](https://github.com/apache/spark/commit/9b5ee1d9d648ae454cc0749047528509a032c7ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210236961
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55869/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12345#discussion_r59649827
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala ---
    @@ -513,8 +555,11 @@ case class TungstenAggregate(
         ctx.currentVars = input
         val keyCode = GenerateUnsafeProjection.createCode(
    --- End diff --
    
    I think we can do better with the variable names to improve readability.
    
    keyCode, groupByKeys, key all sound too similar. Add row or unsafe row to the keys that have been collected to an unsafe row or differentiate them somehow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-210236958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14447][SQL] Speed up TungstenAggregate ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12345#issuecomment-209693429
  
    **[Test build #55750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55750/consoleFull)** for PR 12345 at commit [`ececd57`](https://github.com/apache/spark/commit/ececd5770e0c5410cf3da463f3b52db467d1f5ca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org