You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/03/30 06:51:22 UTC

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/12055

    [SPARK-14263][SQL] Benchmark Vectorized HashMap for GroupBy Aggregates

    ## What changes were proposed in this pull request?
    
    This PR proposes a new data-structure called based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map.
    
    ## How was this patch tested?
    
        Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
        BytesToBytesMap:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        hash                                      108 /  119         96.9          10.3       1.0X
        fast hash                                  63 /   70        166.2           6.0       1.7X
        arrayEqual                                 70 /   73        150.8           6.6       1.6X
        Java HashMap (Long)                       141 /  200         74.3          13.5       0.8X
        Java HashMap (two ints)                   145 /  185         72.3          13.8       0.7X
        Java HashMap (UnsafeRow)                  499 /  524         21.0          47.6       0.2X
        BytesToBytesMap (off Heap)                483 /  548         21.7          46.0       0.2X
        BytesToBytesMap (on Heap)                 485 /  562         21.6          46.2       0.2X
        Vectorized Hashmap                         54 /   60        193.7           5.2       2.0X


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark vectorized-hashmap

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12055
    
----
commit 7fd7d705347cb3f45f07d8db2e0188cb534f220d
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-03-28T18:21:13Z

    Vectorized HashMap

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203676418
  
    **[Test build #54563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54563/consoleFull)** for PR 12055 at commit [`f11c12f`](https://github.com/apache/spark/commit/f11c12f946fc13afcafc99c850d4a3063f032429).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203248157
  
    **[Test build #54493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54493/consoleFull)** for PR 12055 at commit [`7fd7d70`](https://github.com/apache/spark/commit/7fd7d705347cb3f45f07d8db2e0188cb534f220d).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class VectorizedHashMap `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203699659
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54563/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203673714
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57844416
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of a single-key/single value vectorized hash map that can
    + * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
    + */
    +public class VectorizedHashMap {
    +  public ColumnarBatch batch;
    +  public int[] buckets;
    +  private int numBuckets;
    +  private int numRows = 0;
    +  private int maxSteps = 3;
    +
    +  public VectorizedHashMap(int capacity, double loadFactor, int maxSteps) {
    +    StructType schema = new StructType()
    +        .add("key", LongType)
    +        .add("value", LongType);
    +    this.maxSteps = maxSteps;
    +    numBuckets = capacity;
    +    batch = ColumnarBatch.allocate(schema, MemoryMode.ON_HEAP, (int) (numBuckets * loadFactor));
    +    buckets = new int[numBuckets];
    +    Arrays.fill(buckets, -1);
    +  }
    +
    +  public int findOrInsert(long key) {
    +    int idx = find(key);
    +    if (idx != -1 && buckets[idx] == -1) {
    +      batch.column(0).putLong(numRows, key);
    +      batch.column(1).putLong(numRows, 0);
    +      buckets[idx] = numRows++;
    +    }
    +    return idx;
    +  }
    +
    +  public int find(long key) {
    +    long h = hash(key);
    +    int step = 0;
    +    int idx = (int) h & (numBuckets - 1);
    +    while (step < maxSteps) {
    +      if ((buckets[idx] == -1) || (buckets[idx] != -1 && equals(idx, key))) return idx;
    --- End diff --
    
    I don't think you need the check for -1 twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203272766
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57978007
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java ---
    @@ -0,0 +1,107 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of an append-only single-key/single value aggregate hash
    + * map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates
    + * (and fall back to the `BytesToBytesMap` if a given key isn't found). This can be potentially
    + * 'codegened' in TungstenAggregate to speed up aggregates w/ key.
    + *
    + * It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the
    + * key-value pairs. The index lookups in the array rely on linear probing (with a small number of
    + * maximum tries) and use an inexpensive hash function which makes it really efficient for a
    + * majority of lookups. However, using linear probing and an inexpensive hash function also makes it
    + * less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even
    + * for certain distribution of keys) and requires us to fall back on the latter for correctness.
    + */
    +public class AggregateHashMap {
    +  public ColumnarBatch batch;
    +  public int[] buckets;
    +
    +  private int numBuckets;
    +  private int numRows = 0;
    +  private int maxSteps = 3;
    +
    +  private static int DEFAULT_NUM_BUCKETS = 65536 * 4;
    --- End diff --
    
    by `capacity` I was implying `numBuckets` instead of the capacity of the batch, but I yes, the latter makes more sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203247691
  
    cc @nongli 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57844531
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of a single-key/single value vectorized hash map that can
    + * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
    + */
    +public class VectorizedHashMap {
    +  public ColumnarBatch batch;
    --- End diff --
    
    do these needs to be public?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203693531
  
    **[Test build #54562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54562/consoleFull)** for PR 12055 at commit [`69a0302`](https://github.com/apache/spark/commit/69a0302294fff37945d3bd5be6da5feae348446a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57844666
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of a single-key/single value vectorized hash map that can
    + * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
    + */
    +public class VectorizedHashMap {
    +  public ColumnarBatch batch;
    +  public int[] buckets;
    +  private int numBuckets;
    +  private int numRows = 0;
    +  private int maxSteps = 3;
    +
    +  public VectorizedHashMap(int capacity, double loadFactor, int maxSteps) {
    --- End diff --
    
    I think this should take the schema as the parameter.
    
    capacity needs to be a power of 2 for the mod to work. I'm not sure these should be exposed for the typical caller. At the very least, expose a ctor with reasonable defaults.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203272570
  
    **[Test build #54494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54494/consoleFull)** for PR 12055 at commit [`20a932f`](https://github.com/apache/spark/commit/20a932fe039bc7737b38a889b739382a19aeaba9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203693939
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203272770
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54494/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203248160
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12055


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203248630
  
    **[Test build #54494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54494/consoleFull)** for PR 12055 at commit [`20a932f`](https://github.com/apache/spark/commit/20a932fe039bc7737b38a889b739382a19aeaba9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203248162
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54493/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57977135
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of a single-key/single value vectorized hash map that can
    + * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
    + */
    +public class VectorizedHashMap {
    +  public ColumnarBatch batch;
    --- End diff --
    
    currently I need to access this in the benchmark to update the aggregated value


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57844469
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of a single-key/single value vectorized hash map that can
    + * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
    + */
    +public class VectorizedHashMap {
    --- End diff --
    
    I think you should comment the overall design of this data structure, where it is good and where it is bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203248086
  
    **[Test build #54493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54493/consoleFull)** for PR 12055 at commit [`7fd7d70`](https://github.com/apache/spark/commit/7fd7d705347cb3f45f07d8db2e0188cb534f220d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203699480
  
    **[Test build #54563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54563/consoleFull)** for PR 12055 at commit [`f11c12f`](https://github.com/apache/spark/commit/f11c12f946fc13afcafc99c850d4a3063f032429).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203671857
  
    **[Test build #54562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54562/consoleFull)** for PR 12055 at commit [`69a0302`](https://github.com/apache/spark/commit/69a0302294fff37945d3bd5be6da5feae348446a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203699658
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203671183
  
    Thanks, comments addressed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12055#issuecomment-203693942
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54562/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12055#discussion_r57977414
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java ---
    @@ -0,0 +1,107 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.util.Arrays;
    +
    +import org.apache.spark.memory.MemoryMode;
    +import org.apache.spark.sql.types.StructType;
    +
    +import static org.apache.spark.sql.types.DataTypes.LongType;
    +
    +/**
    + * This is an illustrative implementation of an append-only single-key/single value aggregate hash
    + * map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates
    + * (and fall back to the `BytesToBytesMap` if a given key isn't found). This can be potentially
    + * 'codegened' in TungstenAggregate to speed up aggregates w/ key.
    + *
    + * It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the
    + * key-value pairs. The index lookups in the array rely on linear probing (with a small number of
    + * maximum tries) and use an inexpensive hash function which makes it really efficient for a
    + * majority of lookups. However, using linear probing and an inexpensive hash function also makes it
    + * less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even
    + * for certain distribution of keys) and requires us to fall back on the latter for correctness.
    + */
    +public class AggregateHashMap {
    +  public ColumnarBatch batch;
    +  public int[] buckets;
    +
    +  private int numBuckets;
    +  private int numRows = 0;
    +  private int maxSteps = 3;
    +
    +  private static int DEFAULT_NUM_BUCKETS = 65536 * 4;
    --- End diff --
    
    this is weird. configure the max capacity (in the batch) and the load factor and size numbuckets to capacity / load_factor. You have dependent constants here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org