You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/04/05 03:07:33 UTC

[GitHub] spark pull request: [WIP][SPARK-14394][SQL] Generate AggregateHash...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/12161

    [WIP][SPARK-14394][SQL] Generate AggregateHashMap class during TungstenAggregate codegen

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark tungsten-aggregate

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12161
    
----
commit a1fe9f83c182ea02b6c9a8825ba90f10a5e6d638
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-03-31T21:15:34Z

    Make ColumnarBatch.Row mutable

commit 2ea924397c45ad69e88b38c5a8a2ea9d7b926a64
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-03-31T21:15:34Z

    Make ColumnarBatch.Row mutable

commit 3c54cd0072bc7c3191419c2a9a379b9377941152
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-01T06:12:42Z

    insert hashmap

commit ec14a7f63106917aecf9cc4e372436cfe7f8ac52
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-01T18:50:36Z

    initial attempts

commit 2002131156909e42d0617e684a7f6fa373699d92
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-02T00:14:44Z

    CR

commit 85e3c908a86c8f4d3d994360f72ea64c64b8bc5b
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-04T18:47:37Z

    Merge branch 'mutable-row' of github.com:sameeragarwal/spark into tungsten-aggregate

commit 1f5a60fbc60436f0409cd4e7ec4b12937b6e4294
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-04T23:45:30Z

    codegened

commit 8a47e1ea9886a3389d7254da4391fa2446e74b40
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-04-05T00:48:33Z

    Generate codegened Hashmap

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-206025288
  
    **[Test build #55036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55036/consoleFull)** for PR 12161 at commit [`a31be48`](https://github.com/apache/spark/commit/a31be487e9f369c0eb30c4e22df85765867a3478).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12161#discussion_r58978723
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregateHashMap.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
    +import org.apache.spark.sql.types.StructType
    +
    +class TungstenAggregateHashMap(
    +    ctx: CodegenContext,
    +    generatedClassName: String,
    +    groupingKeySchema: StructType,
    +    bufferSchema: StructType) {
    +  val groupingKeys = groupingKeySchema.map(k => (k.dataType.typeName, ctx.freshName("key")))
    +  val bufferValues = bufferSchema.map(k => (k.dataType.typeName, ctx.freshName("value")))
    +  val groupingKeySignature = groupingKeys.map(_.productIterator.toList.mkString(" ")).mkString(", ")
    +
    +  def generate(): String = {
    +    s"""
    +       |public class $generatedClassName {
    +       |${initializeAggregateHashMap()}
    +       |
    +       |${generateFindOrInsert()}
    +       |
    +       |${generateEquals()}
    +       |
    +       |${generateHashFunction()}
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def initializeAggregateHashMap(): String = {
    +    val generatedSchema: String =
    +      s"""
    +         |new org.apache.spark.sql.types.StructType()
    +         |${(groupingKeySchema ++ bufferSchema).map(key =>
    +          s""".add("${key.name}", org.apache.spark.sql.types.DataTypes.${key.dataType})""")
    +          .mkString("\n")};
    +      """.stripMargin
    +
    +    s"""
    +       |  private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    +       |  private int[] buckets;
    +       |  private int numBuckets;
    +       |  private int maxSteps;
    +       |  private int numRows = 0;
    +       |  private org.apache.spark.sql.types.StructType schema = $generatedSchema
    +       |
    +       |  public $generatedClassName(int capacity, double loadFactor, int maxSteps) {
    +       |    assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
    +       |    this.maxSteps = maxSteps;
    +       |    numBuckets = (int) (capacity / loadFactor);
    +       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    +       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    +       |    buckets = new int[numBuckets];
    +       |    java.util.Arrays.fill(buckets, -1);
    +       |  }
    +       |
    +       |  public $generatedClassName() {
    +       |    new $generatedClassName(1 << 16, 0.25, 5);
    +       |  }
    +     """.stripMargin
    +  }
    +
    +  def generateHashFunction(): String = {
    --- End diff --
    
    one thing that might be useful is to put the generated code actually in as comments. 
    
    same for the generateEquals and generateFindOrInsert


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207267713
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12161#discussion_r59043603
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregateHashMap.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
    +import org.apache.spark.sql.types.StructType
    +
    +class TungstenAggregateHashMap(
    +    ctx: CodegenContext,
    +    generatedClassName: String,
    +    groupingKeySchema: StructType,
    +    bufferSchema: StructType) {
    +  val groupingKeys = groupingKeySchema.map(key => (key.dataType.typeName, ctx.freshName("key")))
    +  val bufferValues = bufferSchema.map(key => (ctx.freshName("value"), key.dataType.typeName))
    +  val groupingKeySignature = groupingKeys.map(_.productIterator.toList.mkString(" ")).mkString(", ")
    +
    +  def generate(): String = {
    +    s"""
    +       |public class $generatedClassName {
    +       |${initializeAggregateHashMap()}
    +       |
    +       |${generateFindOrInsert()}
    +       |
    +       |${generateEquals()}
    +       |
    +       |${generateHashFunction()}
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def initializeAggregateHashMap(): String = {
    +    val generatedSchema: String =
    --- End diff --
    
    that was my initial thought too but this generated class only works for one schema due to the specialized equals/hash/find signatures. It's not particularly useful to pass in a schema if only one works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207235417
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207236603
  
    **[Test build #55327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55327/consoleFull)** for PR 12161 at commit [`eb8a020`](https://github.com/apache/spark/commit/eb8a020abb5521bd71fa5d683d8bb3d5857a0287).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207511433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55354/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-206026230
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55036/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207498660
  
    **[Test build #55354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55354/consoleFull)** for PR 12161 at commit [`ec74328`](https://github.com/apache/spark/commit/ec74328ab73766481d3aa7e566fe592bbde747eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-206024191
  
    **[Test build #55034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55034/consoleFull)** for PR 12161 at commit [`a31be48`](https://github.com/apache/spark/commit/a31be487e9f369c0eb30c4e22df85765867a3478).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TungstenAggregateHashMap(`
      * `       |public class $generatedClassName `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207537480
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55371/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207159772
  
    **[Test build #55272 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55272/consoleFull)** for PR 12161 at commit [`071a900`](https://github.com/apache/spark/commit/071a90066eab9f672b561e7db0cab577bb9c38fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207321981
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207134229
  
    I think the old version makes more sense. The generated code only works for a particular schema so no reason to pass it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207234721
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55318/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-206275251
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55104/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207321884
  
    **[Test build #55336 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55336/consoleFull)** for PR 12161 at commit [`eb8a020`](https://github.com/apache/spark/commit/eb8a020abb5521bd71fa5d683d8bb3d5857a0287).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207146943
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12161#discussion_r58978809
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregateHashMap.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
    +import org.apache.spark.sql.types.StructType
    +
    +class TungstenAggregateHashMap(
    +    ctx: CodegenContext,
    +    generatedClassName: String,
    +    groupingKeySchema: StructType,
    +    bufferSchema: StructType) {
    +  val groupingKeys = groupingKeySchema.map(key => (key.dataType.typeName, ctx.freshName("key")))
    +  val bufferValues = bufferSchema.map(key => (ctx.freshName("value"), key.dataType.typeName))
    +  val groupingKeySignature = groupingKeys.map(_.productIterator.toList.mkString(" ")).mkString(", ")
    +
    +  def generate(): String = {
    +    s"""
    +       |public class $generatedClassName {
    +       |${initializeAggregateHashMap()}
    +       |
    +       |${generateFindOrInsert()}
    +       |
    +       |${generateEquals()}
    +       |
    +       |${generateHashFunction()}
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def initializeAggregateHashMap(): String = {
    +    val generatedSchema: String =
    --- End diff --
    
    @nongli how come you asked him to revert to generated schema? it looks pretty weird to generate code to create the schema when it is already available outside codegen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12161#discussion_r58788712
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregateHashMap.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
    +import org.apache.spark.sql.types.StructType
    +
    +class TungstenAggregateHashMap(
    +    ctx: CodegenContext,
    +    generatedClassName: String,
    +    groupingKeySchema: StructType,
    +    bufferSchema: StructType) {
    +  val groupingKeys = groupingKeySchema.map(key => (key.dataType.typeName, ctx.freshName("key")))
    +  val bufferValues = bufferSchema.map(key => (ctx.freshName("value"), key.dataType.typeName))
    +  val groupingKeySignature = groupingKeys.map(_.productIterator.toList.mkString(" ")).mkString(", ")
    +
    +  def generate(): String = {
    +    s"""
    +       |public class $generatedClassName {
    +       |${initializeAggregateHashMap()}
    +       |
    +       |${generateFindOrInsert()}
    +       |
    +       |${generateEquals()}
    +       |
    +       |${generateHashFunction()}
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def initializeAggregateHashMap(): String = {
    +    val generatedSchema: String =
    +      s"""
    +         |new org.apache.spark.sql.types.StructType()
    +         |${(groupingKeySchema ++ bufferSchema).map(key =>
    +            s""".add("${key.name}", org.apache.spark.sql.types.DataTypes.${key.dataType})""")
    +            .mkString("\n")};
    +       """.stripMargin
    +
    +    s"""
    +       |  private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
    +       |  private int[] buckets;
    +       |  private int numBuckets;
    +       |  private int maxSteps;
    +       |  private int numRows = 0;
    +       |  private org.apache.spark.sql.types.StructType schema = $generatedSchema
    +       |
    +       |  public $generatedClassName(int capacity, double loadFactor, int maxSteps) {
    +       |    assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
    +       |    this.maxSteps = maxSteps;
    +       |    numBuckets = (int) (capacity / loadFactor);
    +       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    +       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    +       |    buckets = new int[numBuckets];
    +       |    java.util.Arrays.fill(buckets, -1);
    +       |  }
    +       |
    +       |  public $generatedClassName() {
    +       |    new $generatedClassName(1 << 16, 0.25, 5);
    +       |  }
    +     """.stripMargin
    +  }
    +
    +  def generateHashFunction(): String = {
    +    s"""
    +       |// TODO: Improve this Hash Function
    +       |private long hash($groupingKeySignature) {
    +       |  return ${groupingKeys.map(_._2).mkString(" & ")};
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def generateEquals(): String = {
    +    s"""
    +       |private boolean equals(int idx, $groupingKeySignature) {
    +       |  return ${groupingKeys.zipWithIndex.map(key =>
    +            s"batch.column(${key._2}).getLong(buckets[idx]) == ${key._1._2}").mkString(" && ")};
    +       |}
    +     """.stripMargin
    +  }
    +
    +  def generateFindOrInsert(): String = {
    +    s"""
    +       |public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(${
    +          groupingKeySignature}) {
    +       |  int idx = find(${groupingKeys.map(_._2).mkString(", ")});
    +       |  if (idx != -1 && buckets[idx] == -1) {
    +       |    ${groupingKeys.zipWithIndex.map(key =>
    +              s"batch.column(${key._2}).putLong(numRows, ${key._1._2});").mkString("\n")}
    +       |    ${bufferValues.zipWithIndex.map(key =>
    +              s"batch.column(${groupingKeys.length + key._2}).putLong(numRows, 0);")
    +              .mkString("\n")}
    +       |    buckets[idx] = numRows++;
    +       |  }
    +       |  return batch.getRow(buckets[idx]);
    +       |}
    +       |
    +       |private int find($groupingKeySignature) {
    --- End diff --
    
    Let's simplify this. The generated code only needs findOrInsert() and doesn't need find.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207534882
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207511426
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-206062012
  
    **[Test build #55068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55068/consoleFull)** for PR 12161 at commit [`bd96657`](https://github.com/apache/spark/commit/bd96657854cc643547ddabd8efee6f645ee7a7ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207321991
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55336/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207267628
  
    **[Test build #55327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55327/consoleFull)** for PR 12161 at commit [`eb8a020`](https://github.com/apache/spark/commit/eb8a020abb5521bd71fa5d683d8bb3d5857a0287).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12161#issuecomment-207579368
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14394][SQL] Generate AggregateHashMap c...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12161#discussion_r58978492
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregateHashMap.scala ---
    @@ -0,0 +1,125 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.aggregate
    +
    +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
    +import org.apache.spark.sql.types.StructType
    +
    +class TungstenAggregateHashMap(
    --- End diff --
    
    we should document how this thing works in the classdoc (i.e. explain the physical layout).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org