You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ooq <gi...@git.apache.org> on 2016/07/13 08:15:58 UTC

[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

GitHub user ooq opened a pull request:

    https://github.com/apache/spark/pull/14176

    [SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec

    ## What changes were proposed in this pull request?
    
    This PR is the second step for the following feature:
    
    For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields). 
    In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBatch`. We then automatically pick between the two implementations based on certain knobs.
    
    In this second-step PR, we enable `RowBasedHashMapGenerator` in `HashAggregateExec`. 
    
    ## How was this patch tested?
    
    Tests and benchmarks will be added in a separate PR in the series. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ooq/spark rowbasedfastaggmap-pr2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14176
    
----
commit c87f26b318b5d673ac95454df5c1cb9a56c677eb
Author: Qifan Pu <qi...@gmail.com>
Date:   2016-07-13T07:35:06Z

    add RowBatch and RowBasedHashMapGenerator

commit a3360e0ab1223dd43f891e755e648680a402b7df
Author: Qifan Pu <qi...@gmail.com>
Date:   2016-07-13T08:08:35Z

    enable row based hashmap

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73743987
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.DecimalType
    +
    +abstract class AggregateHashMapSuite extends DataFrameAggregateSuite {
    +  import testImplicits._
    +
    +  protected def setAggregateHashMapImpl(): Unit
    +
    +  protected override def beforeAll(): Unit = {
    +      setAggregateHashMapImpl()
    +      sparkConf.set("spark.sql.codegen.fallback", "false")
    +      super.beforeAll()
    +  }
    +
    +  test("SQL decimal test") {
    --- End diff --
    
    can we just add this in `DataFrameAggregateSuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62944/consoleFull)** for PR 14176 at commit [`7194394`](https://github.com/apache/spark/commit/71943941ebe548e0a2c66d633893b7e2196b94a6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62990/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62227/consoleFull)** for PR 14176 at commit [`a3360e0`](https://github.com/apache/spark/commit/a3360e0ab1223dd43f891e755e648680a402b7df).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    @clockfly as Qifan said, the rationale for not deleting the old vectorized hashmap code in the short-term was to enable us to quickly benchmark and compare the two implementations for a wide variety of workloads.
    
    That said, I think the high level issue is that we don't currently expose a good interface/hooks in our generated code that can be used to test custom operator implementations while running benchmarks or tests (... and given these first level aggregate hashmap are entirely generated during query compilation, injecting a class that can work for all schema types during testing isn't very straightforward).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Thanks for the comments @davies @sameeragarwal . This PR has been updated. Basically the only public boolean flag now is called `spark.sql.codegen.aggregate.map.twolevel.enable`. There is a separate non-public flag `spark.sql.codegen.aggregate.map.vectorized.enable` that allows testing and benchmarking of vectorized hashmap before we remove vectorized hashmap completely. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62990/consoleFull)** for PR 14176 at commit [`def94cc`](https://github.com/apache/spark/commit/def94ccf2b0630bcbc88067dc4e6f57434afb8ee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62934 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62934/consoleFull)** for PR 14176 at commit [`5fae053`](https://github.com/apache/spark/commit/5fae053421658e8194a16752eeebe31a92177e67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73036933
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -279,9 +280,15 @@ case class HashAggregateExec(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for Vectorized HashMap
    -  private var vectorizedHashMapTerm: String = _
    -  private var isVectorizedHashMapEnabled: Boolean = _
    +  // The name for Fast HashMap
    +  private var fastHashMapTerm: String = _
    +  // whether vectorized hashmap or row based hashmap is enabled
    +  // we make sure that at most one of the two flags is true
    +  // i.e., assertFalse(isVectorizedHashMapEnabled && isRowBasedHashMapEnabled)
    +  private var isVectorizedHashMapEnabled: Boolean = false
    +  private var isRowBasedHashMapEnabled: Boolean = false
    +  // auxiliary flag, true if any of two above is true
    +  private var isFastHashMapEnabled: Boolean = false
    --- End diff --
    
    `isFastHashMapEnabled  = isVectorizedHashMapEnabled  || isRowBasedHashMapEnabled`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62382/consoleFull)** for PR 14176 at commit [`a158125`](https://github.com/apache/spark/commit/a158125956627e502a8045fb077760063a3ca397).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76340181
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -518,15 +573,30 @@ case class HashAggregateExec(
         val doAgg = ctx.freshName("doAggregateWithKeys")
         val peakMemory = metricTerm(ctx, "peakMemory")
         val spillSize = metricTerm(ctx, "spillSize")
    +
    +    def generateGenerateCode(): String = {
    +      if (isFastHashMapEnabled) {
    +        if (isVectorizedHashMapEnabled) {
    +          s"""
    +               | ${fastHashMapGenerator.asInstanceOf[VectorizedHashMapGenerator].generate()}
    +          """.stripMargin
    +        } else {
    +          s"""
    +               | ${fastHashMapGenerator.asInstanceOf[RowBasedHashMapGenerator].generate()}
    +          """.stripMargin
    +        }
    --- End diff --
    
    If we are sure RowBasedHashMapGenerator is better than VectorizedHashMapGenerator in all cases, then probably we should not put both implementations in the production code. If we have requirements like performance benchmarking, maybe we should add the code in test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76340251
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -518,15 +573,30 @@ case class HashAggregateExec(
         val doAgg = ctx.freshName("doAggregateWithKeys")
         val peakMemory = metricTerm(ctx, "peakMemory")
         val spillSize = metricTerm(ctx, "spillSize")
    +
    +    def generateGenerateCode(): String = {
    +      if (isFastHashMapEnabled) {
    +        if (isVectorizedHashMapEnabled) {
    +          s"""
    +               | ${fastHashMapGenerator.asInstanceOf[VectorizedHashMapGenerator].generate()}
    +          """.stripMargin
    +        } else {
    +          s"""
    +               | ${fastHashMapGenerator.asInstanceOf[RowBasedHashMapGenerator].generate()}
    +          """.stripMargin
    +        }
    --- End diff --
    
    It is confusing to see different flags flying, isFastHashMapEnabled, isVectorizedHashMapEnabled...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62482/consoleFull)** for PR 14176 at commit [`461028e`](https://github.com/apache/spark/commit/461028e62c9d9821cf11abdb9d85e9a8edb58ba4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76338880
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -459,52 +475,91 @@ case class HashAggregateExec(
       }
     
       /**
    -   * Using the vectorized hash map in HashAggregate is currently supported for all primitive
    -   * data types during partial aggregation. However, we currently only enable the hash map for a
    -   * subset of cases that've been verified to show performance improvements on our benchmarks
    -   * subject to an internal conf that sets an upper limit on the maximum length of the aggregate
    -   * key/value schema.
    -   *
    +   * A required check for any fast hash map implementation (basically the common requirements
    +   * for row-based and vectorized).
    +   * Currently fast hash map is supported for primitive data types during partial aggregation.
        * This list of supported use-cases should be expanded over time.
        */
    -  private def enableVectorizedHashMap(ctx: CodegenContext): Boolean = {
    -    val schemaLength = (groupingKeySchema ++ bufferSchema).length
    +  private def checkIfFastHashMapSupported(ctx: CodegenContext): Boolean = {
         val isSupported =
           (groupingKeySchema ++ bufferSchema).forall(f => ctx.isPrimitiveType(f.dataType) ||
             f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType]) &&
             bufferSchema.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge)
     
    -    // We do not support byte array based decimal type for aggregate values as
    -    // ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
    +    // For vectorized hash map, We do not support byte array based decimal type for aggregate values
    --- End diff --
    
    I am not sure whether it still make sense to mention "vectorized hash map," here, if the vectorized hash map is not used in the code implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73039235
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -499,14 +499,16 @@ object SQLConf {
           .intConf
           .createWithDefault(40)
     
    -  val VECTORIZED_AGG_MAP_MAX_COLUMNS =
    -    SQLConfigBuilder("spark.sql.codegen.aggregate.map.columns.max")
    +  val ENFORCE_FAST_AGG_MAP_IMPL =
    --- End diff --
    
    nit: Maybe just call this `FAST_AGG_MAP_IMP`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76341601
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -459,52 +475,91 @@ case class HashAggregateExec(
       }
     
       /**
    -   * Using the vectorized hash map in HashAggregate is currently supported for all primitive
    -   * data types during partial aggregation. However, we currently only enable the hash map for a
    -   * subset of cases that've been verified to show performance improvements on our benchmarks
    -   * subject to an internal conf that sets an upper limit on the maximum length of the aggregate
    -   * key/value schema.
    -   *
    +   * A required check for any fast hash map implementation (basically the common requirements
    +   * for row-based and vectorized).
    +   * Currently fast hash map is supported for primitive data types during partial aggregation.
        * This list of supported use-cases should be expanded over time.
        */
    -  private def enableVectorizedHashMap(ctx: CodegenContext): Boolean = {
    -    val schemaLength = (groupingKeySchema ++ bufferSchema).length
    +  private def checkIfFastHashMapSupported(ctx: CodegenContext): Boolean = {
         val isSupported =
           (groupingKeySchema ++ bufferSchema).forall(f => ctx.isPrimitiveType(f.dataType) ||
             f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType]) &&
             bufferSchema.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge)
     
    -    // We do not support byte array based decimal type for aggregate values as
    -    // ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
    +    // For vectorized hash map, We do not support byte array based decimal type for aggregate values
    +    // as ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
         // updates. Due to this, appending the byte array in the vectorized hash map can turn out to be
         // quite inefficient and can potentially OOM the executor.
    +    // For row-based hash map, while decimal update is supported in UnsafeRow, we will just act
    +    // conservative here, due to lack of testing and benchmarking.
         val isNotByteArrayDecimalType = bufferSchema.map(_.dataType).filter(_.isInstanceOf[DecimalType])
           .forall(!DecimalType.isByteArrayDecimalType(_))
     
    -    isSupported  && isNotByteArrayDecimalType &&
    -      schemaLength <= sqlContext.conf.vectorizedAggregateMapMaxColumns
    +    isSupported  && isNotByteArrayDecimalType
    +  }
    +
    +  private def enableTwoLevelHashMap(ctx: CodegenContext) = {
    +    if (!checkIfFastHashMapSupported(ctx)) {
    +      if (modes.forall(mode => mode == Partial || mode == PartialMerge) && !Utils.isTesting) {
    +        logInfo("spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but"
    +          + " current version of codegened fast hashmap does not support this aggregate.")
    +      }
    +    } else {
    +      isFastHashMapEnabled = true
    +
    +      // This is for testing/benchmarking only.
    +      // We enforce to first level to be a vectorized hashmap, instead of the default row-based one.
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => isVectorizedHashMapEnabled = true
    +        case null | "" | "false" => None      }
    +    }
       }
     
       private def doProduceWithKeys(ctx: CodegenContext): String = {
         val initAgg = ctx.freshName("initAgg")
         ctx.addMutableState("boolean", initAgg, s"$initAgg = false;")
    -    isVectorizedHashMapEnabled = enableVectorizedHashMap(ctx)
    -    vectorizedHashMapTerm = ctx.freshName("vectorizedHashMap")
    -    val vectorizedHashMapClassName = ctx.freshName("VectorizedHashMap")
    -    val vectorizedHashMapGenerator = new VectorizedHashMapGenerator(ctx, aggregateExpressions,
    -      vectorizedHashMapClassName, groupingKeySchema, bufferSchema)
    +    if (sqlContext.conf.enableTwoLevelAggMap) {
    +      enableTwoLevelHashMap(ctx)
    +    } else {
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => logWarning("Two level hashmap is disabled but vectorized hashmap is " +
    +          "enabled.")
    +        case null | "" | "false" => None
    +      }
    +    }
    +    fastHashMapTerm = ctx.freshName("fastHashMap")
    +    val fastHashMapClassName = ctx.freshName("FastHashMap")
    +    val fastHashMapGenerator =
    +      if (isVectorizedHashMapEnabled) {
    +        new VectorizedHashMapGenerator(ctx, aggregateExpressions,
    +          fastHashMapClassName, groupingKeySchema, bufferSchema)
    +      } else {
    +        new RowBasedHashMapGenerator(ctx, aggregateExpressions,
    +          fastHashMapClassName, groupingKeySchema, bufferSchema)
    +      }
    +
    +    val thisPlan = ctx.addReferenceObj("plan", this)
    +
         // Create a name for iterator from vectorized HashMap
    -    val iterTermForVectorizedHashMap = ctx.freshName("vectorizedHashMapIter")
    -    if (isVectorizedHashMapEnabled) {
    -      ctx.addMutableState(vectorizedHashMapClassName, vectorizedHashMapTerm,
    -        s"$vectorizedHashMapTerm = new $vectorizedHashMapClassName();")
    -      ctx.addMutableState(
    -        "java.util.Iterator<org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row>",
    -        iterTermForVectorizedHashMap, "")
    +    val iterTermForFastHashMap = ctx.freshName("fastHashMapIter")
    +    if (isFastHashMapEnabled) {
    +      if (isVectorizedHashMapEnabled) {
    +        ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
    --- End diff --
    
    Ah, I see VectorizedHashMap is a special Fast hashmap. Do we really need to introduce two flags that has a hierarchy. 
     
    Will two parallel flags like `isRowbasedHashMapEnabled` and `isVectorizedHashMapEnabled` better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #63164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63164/consoleFull)** for PR 14176 at commit [`b32cb7b`](https://github.com/apache/spark/commit/b32cb7b4825ab37326c40141e2cb26922d005ef4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62934 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62934/consoleFull)** for PR 14176 at commit [`5fae053`](https://github.com/apache/spark/commit/5fae053421658e8194a16752eeebe31a92177e67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Added the explicit SQL tests for both hash map implementations. The test suites extend `DataFrameAggregateSuite` and reuse all tests there. For the two bugs that failed previous builds: the length bug would be caught by those tests; the decimal bug is tested with an added "SQL decimal test".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62952/consoleFull)** for PR 14176 at commit [`122cf18`](https://github.com/apache/spark/commit/122cf181841046a026c4ffd7fd0c597a03ff30dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73050331
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -279,9 +280,15 @@ case class HashAggregateExec(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for Vectorized HashMap
    -  private var vectorizedHashMapTerm: String = _
    -  private var isVectorizedHashMapEnabled: Boolean = _
    +  // The name for Fast HashMap
    +  private var fastHashMapTerm: String = _
    +  // whether vectorized hashmap or row based hashmap is enabled
    +  // we make sure that at most one of the two flags is true
    +  // i.e., assertFalse(isVectorizedHashMapEnabled && isRowBasedHashMapEnabled)
    +  private var isVectorizedHashMapEnabled: Boolean = false
    +  private var isRowBasedHashMapEnabled: Boolean = false
    +  // auxiliary flag, true if any of two above is true
    +  private var isFastHashMapEnabled: Boolean = false
    --- End diff --
    
    This flag check if one of them is enabled. As some of the generated code is same for both hash maps, the flag could make condition-checking clearer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62440/consoleFull)** for PR 14176 at commit [`ecff4ff`](https://github.com/apache/spark/commit/ecff4ff3f30aefbaea89a12d2d5b3fda062b0f38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62229/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62452 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62452/consoleFull)** for PR 14176 at commit [`ce72d90`](https://github.com/apache/spark/commit/ce72d900004bfa720460126a3573642a8a97bc53).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #64387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64387/consoleFull)** for PR 14176 at commit [`a58314c`](https://github.com/apache/spark/commit/a58314cc110899b19499b1d06bdc04cf8439a79f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62952/consoleFull)** for PR 14176 at commit [`122cf18`](https://github.com/apache/spark/commit/122cf181841046a026c4ffd7fd0c597a03ff30dd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62952/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62382/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62440/consoleFull)** for PR 14176 at commit [`ecff4ff`](https://github.com/apache/spark/commit/ecff4ff3f30aefbaea89a12d2d5b3fda062b0f38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73038709
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -660,36 +778,74 @@ case class HashAggregateExec(
           }
         }
     
    +
         val updateRowInVectorizedHashMap: Option[String] = {
    --- End diff --
    
    This and `updateRowInRowBasedHashMap` seem to have a lot of duplicated code. Can this be avoided?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    @davies @sameeragarwal I updated more results in the benchmark PR https://github.com/apache/spark/pull/14266 . 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    cc @sameeragarwal @davies @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62482/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62440/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62934/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Thanks for the comments @clockfly ! As per discussion with @sameeragarwal, I think the plan is to give users the option to turn on/off two-level hashmap. This is why we have this first level logic for enabling two-level/fast map. We also want to keep both impls (vectorized /row-based) for a while before deleting vectorized in the future, which leads to the internal flags that pick between the two impls. If you guys decide otherwise, I'm happy to update the PR accordingly. @clockfly @sameeragarwal @davies Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76338535
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -279,9 +280,14 @@ case class HashAggregateExec(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for Vectorized HashMap
    -  private var vectorizedHashMapTerm: String = _
    -  private var isVectorizedHashMapEnabled: Boolean = _
    +  // The name for Fast HashMap
    +  private var fastHashMapTerm: String = _
    --- End diff --
    
    Should we use a more descriptive name than "fast", there can always be faster implementation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62227/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64387/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62944/consoleFull)** for PR 14176 at commit [`7194394`](https://github.com/apache/spark/commit/71943941ebe548e0a2c66d633893b7e2196b94a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    LGTM, I will merge this one to master (enable us to do more benchmarks with these two implementations).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14176


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62349/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Can we make this `spark.sql.codegen.aggregate.map.twolevel.enable` internal? otherwise we should have a better name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #64387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64387/consoleFull)** for PR 14176 at commit [`a58314c`](https://github.com/apache/spark/commit/a58314cc110899b19499b1d06bdc04cf8439a79f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73747068
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/AggregateHashMapSuite.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.DecimalType
    +
    +abstract class AggregateHashMapSuite extends DataFrameAggregateSuite {
    --- End diff --
    
    As discussed offline, let's just move this in `DataFrameAggregateSuite` to prevent inadvertent overrides.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    @davies I guess there is still benefit to make it public? If the user knows that their workload would always run faster with single-level, e.g., many distinct keys. I thought about `spark.sql.codegen.aggregate.map.fast.enable` or `spark.sql.codegen.aggregate.map.codegen.enable`, but none of them captures the fact that the biggest distinction is the two level design.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62452/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63164/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62229/consoleFull)** for PR 14176 at commit [`9b0b294`](https://github.com/apache/spark/commit/9b0b294013239f4db744d7f5f5c1bdf838dd0559).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73037506
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -459,52 +476,115 @@ case class HashAggregateExec(
       }
     
       /**
    -   * Using the vectorized hash map in HashAggregate is currently supported for all primitive
    -   * data types during partial aggregation. However, we currently only enable the hash map for a
    -   * subset of cases that've been verified to show performance improvements on our benchmarks
    -   * subject to an internal conf that sets an upper limit on the maximum length of the aggregate
    -   * key/value schema.
    -   *
    +   * A required check for any fast hash map implementation (basically the common requirements
    +   * for row-based and vectorized).
    +   * Currently fast hash map is supported for primitive data types during partial aggregation.
        * This list of supported use-cases should be expanded over time.
        */
    -  private def enableVectorizedHashMap(ctx: CodegenContext): Boolean = {
    -    val schemaLength = (groupingKeySchema ++ bufferSchema).length
    +  private def checkIfFastHashMapSupported(ctx: CodegenContext): Boolean = {
         val isSupported =
           (groupingKeySchema ++ bufferSchema).forall(f => ctx.isPrimitiveType(f.dataType) ||
             f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType]) &&
             bufferSchema.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge)
     
    -    // We do not support byte array based decimal type for aggregate values as
    -    // ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
    -    // updates. Due to this, appending the byte array in the vectorized hash map can turn out to be
    -    // quite inefficient and can potentially OOM the executor.
    +    // Acting conservative and do not support byte array based decimal type for aggregate values
    +    // for now.
    --- End diff --
    
    let's keep the old comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62349/consoleFull)** for PR 14176 at commit [`225b661`](https://github.com/apache/spark/commit/225b6619cd070ac9da3846a3bd02fa730e4ec835).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76338652
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -279,9 +280,14 @@ case class HashAggregateExec(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for Vectorized HashMap
    -  private var vectorizedHashMapTerm: String = _
    -  private var isVectorizedHashMapEnabled: Boolean = _
    +  // The name for Fast HashMap
    +  private var fastHashMapTerm: String = _
    +  private var isFastHashMapEnabled: Boolean = false
    +
    +  // whether a vectorized hashmap is used instead
    +  // we have decided to always use the row-based hashmap,
    +  // but the vectorized hashmap can still be switched on for testing and benchmarking purposes.
    +  private var isVectorizedHashMapEnabled: Boolean = false
    --- End diff --
    
    If it is only used in testing/benchmarking, is it worthy to put this piece in the production code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73747170
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -499,14 +499,16 @@ object SQLConf {
           .intConf
           .createWithDefault(40)
     
    -  val VECTORIZED_AGG_MAP_MAX_COLUMNS =
    -    SQLConfigBuilder("spark.sql.codegen.aggregate.map.columns.max")
    --- End diff --
    
    Let's also make sure that all references to the old config are also appropriately modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73070005
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -279,9 +280,15 @@ case class HashAggregateExec(
         .map(_.asInstanceOf[DeclarativeAggregate])
       private val bufferSchema = StructType.fromAttributes(aggregateBufferAttributes)
     
    -  // The name for Vectorized HashMap
    -  private var vectorizedHashMapTerm: String = _
    -  private var isVectorizedHashMapEnabled: Boolean = _
    +  // The name for Fast HashMap
    +  private var fastHashMapTerm: String = _
    +  // whether vectorized hashmap or row based hashmap is enabled
    +  // we make sure that at most one of the two flags is true
    +  // i.e., assertFalse(isVectorizedHashMapEnabled && isRowBasedHashMapEnabled)
    +  private var isVectorizedHashMapEnabled: Boolean = false
    +  private var isRowBasedHashMapEnabled: Boolean = false
    +  // auxiliary flag, true if any of two above is true
    +  private var isFastHashMapEnabled: Boolean = false
    --- End diff --
    
    Sure, what I meant was that we can even initialize it with `isVectorizedHashMapEnabled || isRowBasedHashMapEnabled` to make the implied semantics clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62482/consoleFull)** for PR 14176 at commit [`461028e`](https://github.com/apache/spark/commit/461028e62c9d9821cf11abdb9d85e9a8edb58ba4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #63164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63164/consoleFull)** for PR 14176 at commit [`b32cb7b`](https://github.com/apache/spark/commit/b32cb7b4825ab37326c40141e2cb26922d005ef4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62349/consoleFull)** for PR 14176 at commit [`225b661`](https://github.com/apache/spark/commit/225b6619cd070ac9da3846a3bd02fa730e4ec835).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    @ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Let's hold on this, if we are going to have single implementation for fast hash map (based on the benchmark result in another PR), do need to merge this fancy implementation choosing. cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62227/consoleFull)** for PR 14176 at commit [`a3360e0`](https://github.com/apache/spark/commit/a3360e0ab1223dd43f891e755e648680a402b7df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by ooq <gi...@git.apache.org>.
Github user ooq commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Thanks. I will take a look tonight.
    
    > On Sep 5, 2016, at 4:46 PM, Josh Rosen <no...@github.com> wrote:
    > 
    > @ooq @sameeragarwal, it looks like this patch is the culprit behind some OOMs that I'm observing with random queries; see https://issues.apache.org/jira/browse/SPARK-17405
    > 
    > \u2014
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub, or mute the thread.
    > 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62382/consoleFull)** for PR 14176 at commit [`a158125`](https://github.com/apache/spark/commit/a158125956627e502a8045fb077760063a3ca397).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76339716
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -459,52 +475,91 @@ case class HashAggregateExec(
       }
     
       /**
    -   * Using the vectorized hash map in HashAggregate is currently supported for all primitive
    -   * data types during partial aggregation. However, we currently only enable the hash map for a
    -   * subset of cases that've been verified to show performance improvements on our benchmarks
    -   * subject to an internal conf that sets an upper limit on the maximum length of the aggregate
    -   * key/value schema.
    -   *
    +   * A required check for any fast hash map implementation (basically the common requirements
    +   * for row-based and vectorized).
    +   * Currently fast hash map is supported for primitive data types during partial aggregation.
        * This list of supported use-cases should be expanded over time.
        */
    -  private def enableVectorizedHashMap(ctx: CodegenContext): Boolean = {
    -    val schemaLength = (groupingKeySchema ++ bufferSchema).length
    +  private def checkIfFastHashMapSupported(ctx: CodegenContext): Boolean = {
         val isSupported =
           (groupingKeySchema ++ bufferSchema).forall(f => ctx.isPrimitiveType(f.dataType) ||
             f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType]) &&
             bufferSchema.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge)
     
    -    // We do not support byte array based decimal type for aggregate values as
    -    // ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
    +    // For vectorized hash map, We do not support byte array based decimal type for aggregate values
    +    // as ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
         // updates. Due to this, appending the byte array in the vectorized hash map can turn out to be
         // quite inefficient and can potentially OOM the executor.
    +    // For row-based hash map, while decimal update is supported in UnsafeRow, we will just act
    +    // conservative here, due to lack of testing and benchmarking.
         val isNotByteArrayDecimalType = bufferSchema.map(_.dataType).filter(_.isInstanceOf[DecimalType])
           .forall(!DecimalType.isByteArrayDecimalType(_))
     
    -    isSupported  && isNotByteArrayDecimalType &&
    -      schemaLength <= sqlContext.conf.vectorizedAggregateMapMaxColumns
    +    isSupported  && isNotByteArrayDecimalType
    +  }
    +
    +  private def enableTwoLevelHashMap(ctx: CodegenContext) = {
    +    if (!checkIfFastHashMapSupported(ctx)) {
    +      if (modes.forall(mode => mode == Partial || mode == PartialMerge) && !Utils.isTesting) {
    +        logInfo("spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but"
    +          + " current version of codegened fast hashmap does not support this aggregate.")
    +      }
    +    } else {
    +      isFastHashMapEnabled = true
    +
    +      // This is for testing/benchmarking only.
    +      // We enforce to first level to be a vectorized hashmap, instead of the default row-based one.
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => isVectorizedHashMapEnabled = true
    +        case null | "" | "false" => None      }
    +    }
       }
     
       private def doProduceWithKeys(ctx: CodegenContext): String = {
         val initAgg = ctx.freshName("initAgg")
         ctx.addMutableState("boolean", initAgg, s"$initAgg = false;")
    -    isVectorizedHashMapEnabled = enableVectorizedHashMap(ctx)
    -    vectorizedHashMapTerm = ctx.freshName("vectorizedHashMap")
    -    val vectorizedHashMapClassName = ctx.freshName("VectorizedHashMap")
    -    val vectorizedHashMapGenerator = new VectorizedHashMapGenerator(ctx, aggregateExpressions,
    -      vectorizedHashMapClassName, groupingKeySchema, bufferSchema)
    +    if (sqlContext.conf.enableTwoLevelAggMap) {
    +      enableTwoLevelHashMap(ctx)
    +    } else {
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => logWarning("Two level hashmap is disabled but vectorized hashmap is " +
    +          "enabled.")
    +        case null | "" | "false" => None
    +      }
    +    }
    --- End diff --
    
    Maybe you want to wrap line 521-529 in a function like `checkAndEnableTwoLevelHashMap`, and do a call like
    ```
    isFastHashMapEnabled = checkAndEnableTwoLevelHashMap(ctx)
    ```
    
    Current impl of `enableTwoLevelHashMap` has a side effect of updating boolean value `isFastHashMapEnabled`. User is not clear of the side effect if he doesn't inspect the code of `enableTwoLevelHashMap` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62229/consoleFull)** for PR 14176 at commit [`9b0b294`](https://github.com/apache/spark/commit/9b0b294013239f4db744d7f5f5c1bdf838dd0559).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73038918
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala ---
    @@ -141,8 +141,16 @@ class RowBasedHashMapGenerator(
         }
     
         val createUnsafeRowForKey = groupingKeys.zipWithIndex.map { case (key: Buffer, ordinal: Int) =>
    -      s"agg_rowWriter.write(${ordinal}, ${key.name})"}
    -      .mkString(";\n")
    +      key.dataType match {
    --- End diff --
    
    Can we add a regression test for this in [sql]?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73036686
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java ---
    @@ -165,10 +165,16 @@ private void freeCurrentPage() {
       protected FixedLengthRowBasedKeyValueBatch(StructType keySchema, StructType valueSchema,
                                                  int maxRows, TaskMemoryManager manager) {
         super(keySchema, valueSchema, maxRows, manager);
    -    klen = keySchema.defaultSize()
    -            + UnsafeRow.calculateBitSetWidthInBytes(keySchema.length());
    -    vlen = valueSchema.defaultSize()
    -            + UnsafeRow.calculateBitSetWidthInBytes(valueSchema.length());
    +    int keySize = 0;
    +    int valueSize = 0;
    +    for (String name : keySchema.fieldNames()) {
    +      keySize += (keySchema.apply(name).dataType().defaultSize() + 7) / 8 * 8;
    --- End diff --
    
    Let's add a small comment about this implicit ceiling logic and the reason why `schema.defaultSize()` doesn't work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62944/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r73039793
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -499,14 +499,16 @@ object SQLConf {
           .intConf
           .createWithDefault(40)
     
    -  val VECTORIZED_AGG_MAP_MAX_COLUMNS =
    -    SQLConfigBuilder("spark.sql.codegen.aggregate.map.columns.max")
    +  val ENFORCE_FAST_AGG_MAP_IMPL =
    +    SQLConfigBuilder("spark.sql.codegen.aggregate.map.enforce.impl")
           .internal()
    -      .doc("Sets the maximum width of schema (aggregate keys + values) for which aggregate with" +
    -        "keys uses an in-memory columnar map to speed up execution. Setting this to 0 effectively" +
    -        "disables the columnar map")
    -      .intConf
    -      .createWithDefault(3)
    +      .doc("Sets the implementation for fast hash map during aggregation. Could be one of the " +
    +        "following: rowbased, vectorized, skip, auto. Defaults to auto, and should only be other " +
    +        "values for testing purposes.")
    +      .stringConf
    +      .transform(_.toLowerCase())
    +      .checkValues(Set("rowbased", "vectorized", "skip", "auto"))
    +      .createWithDefault("auto")
    --- End diff --
    
    Note that this removes the entire vectorized hashmap code from the the test path even though we claim to support it. Let's make sure that we have explicit tests that test for both "rowbased" and "vectorized".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

Posted by clockfly <gi...@git.apache.org>.
Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14176#discussion_r76339927
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ---
    @@ -459,52 +475,91 @@ case class HashAggregateExec(
       }
     
       /**
    -   * Using the vectorized hash map in HashAggregate is currently supported for all primitive
    -   * data types during partial aggregation. However, we currently only enable the hash map for a
    -   * subset of cases that've been verified to show performance improvements on our benchmarks
    -   * subject to an internal conf that sets an upper limit on the maximum length of the aggregate
    -   * key/value schema.
    -   *
    +   * A required check for any fast hash map implementation (basically the common requirements
    +   * for row-based and vectorized).
    +   * Currently fast hash map is supported for primitive data types during partial aggregation.
        * This list of supported use-cases should be expanded over time.
        */
    -  private def enableVectorizedHashMap(ctx: CodegenContext): Boolean = {
    -    val schemaLength = (groupingKeySchema ++ bufferSchema).length
    +  private def checkIfFastHashMapSupported(ctx: CodegenContext): Boolean = {
         val isSupported =
           (groupingKeySchema ++ bufferSchema).forall(f => ctx.isPrimitiveType(f.dataType) ||
             f.dataType.isInstanceOf[DecimalType] || f.dataType.isInstanceOf[StringType]) &&
             bufferSchema.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge)
     
    -    // We do not support byte array based decimal type for aggregate values as
    -    // ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
    +    // For vectorized hash map, We do not support byte array based decimal type for aggregate values
    +    // as ColumnVector.putDecimal for high-precision decimals doesn't currently support in-place
         // updates. Due to this, appending the byte array in the vectorized hash map can turn out to be
         // quite inefficient and can potentially OOM the executor.
    +    // For row-based hash map, while decimal update is supported in UnsafeRow, we will just act
    +    // conservative here, due to lack of testing and benchmarking.
         val isNotByteArrayDecimalType = bufferSchema.map(_.dataType).filter(_.isInstanceOf[DecimalType])
           .forall(!DecimalType.isByteArrayDecimalType(_))
     
    -    isSupported  && isNotByteArrayDecimalType &&
    -      schemaLength <= sqlContext.conf.vectorizedAggregateMapMaxColumns
    +    isSupported  && isNotByteArrayDecimalType
    +  }
    +
    +  private def enableTwoLevelHashMap(ctx: CodegenContext) = {
    +    if (!checkIfFastHashMapSupported(ctx)) {
    +      if (modes.forall(mode => mode == Partial || mode == PartialMerge) && !Utils.isTesting) {
    +        logInfo("spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but"
    +          + " current version of codegened fast hashmap does not support this aggregate.")
    +      }
    +    } else {
    +      isFastHashMapEnabled = true
    +
    +      // This is for testing/benchmarking only.
    +      // We enforce to first level to be a vectorized hashmap, instead of the default row-based one.
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => isVectorizedHashMapEnabled = true
    +        case null | "" | "false" => None      }
    +    }
       }
     
       private def doProduceWithKeys(ctx: CodegenContext): String = {
         val initAgg = ctx.freshName("initAgg")
         ctx.addMutableState("boolean", initAgg, s"$initAgg = false;")
    -    isVectorizedHashMapEnabled = enableVectorizedHashMap(ctx)
    -    vectorizedHashMapTerm = ctx.freshName("vectorizedHashMap")
    -    val vectorizedHashMapClassName = ctx.freshName("VectorizedHashMap")
    -    val vectorizedHashMapGenerator = new VectorizedHashMapGenerator(ctx, aggregateExpressions,
    -      vectorizedHashMapClassName, groupingKeySchema, bufferSchema)
    +    if (sqlContext.conf.enableTwoLevelAggMap) {
    +      enableTwoLevelHashMap(ctx)
    +    } else {
    +      sqlContext.getConf("spark.sql.codegen.aggregate.map.vectorized.enable", null) match {
    +        case "true" => logWarning("Two level hashmap is disabled but vectorized hashmap is " +
    +          "enabled.")
    +        case null | "" | "false" => None
    +      }
    +    }
    +    fastHashMapTerm = ctx.freshName("fastHashMap")
    +    val fastHashMapClassName = ctx.freshName("FastHashMap")
    +    val fastHashMapGenerator =
    +      if (isVectorizedHashMapEnabled) {
    +        new VectorizedHashMapGenerator(ctx, aggregateExpressions,
    +          fastHashMapClassName, groupingKeySchema, bufferSchema)
    +      } else {
    +        new RowBasedHashMapGenerator(ctx, aggregateExpressions,
    +          fastHashMapClassName, groupingKeySchema, bufferSchema)
    +      }
    +
    +    val thisPlan = ctx.addReferenceObj("plan", this)
    +
         // Create a name for iterator from vectorized HashMap
    -    val iterTermForVectorizedHashMap = ctx.freshName("vectorizedHashMapIter")
    -    if (isVectorizedHashMapEnabled) {
    -      ctx.addMutableState(vectorizedHashMapClassName, vectorizedHashMapTerm,
    -        s"$vectorizedHashMapTerm = new $vectorizedHashMapClassName();")
    -      ctx.addMutableState(
    -        "java.util.Iterator<org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row>",
    -        iterTermForVectorizedHashMap, "")
    +    val iterTermForFastHashMap = ctx.freshName("fastHashMapIter")
    +    if (isFastHashMapEnabled) {
    +      if (isVectorizedHashMapEnabled) {
    +        ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
    --- End diff --
    
    Since `isFastHashMapEnabled` and `isVectorizedHashMapEnabled` both can be true, does this means that both fastHashMap and vectorized hash map can be used at the same time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62452/consoleFull)** for PR 14176 at commit [`ce72d90`](https://github.com/apache/spark/commit/ce72d900004bfa720460126a3573642a8a97bc53).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14176
  
    **[Test build #62990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62990/consoleFull)** for PR 14176 at commit [`def94cc`](https://github.com/apache/spark/commit/def94ccf2b0630bcbc88067dc4e6f57434afb8ee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org