You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/03/30 06:51:22 UTC
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
GitHub user sameeragarwal opened a pull request:
https://github.com/apache/spark/pull/12055
[SPARK-14263][SQL] Benchmark Vectorized HashMap for GroupBy Aggregates
## What changes were proposed in this pull request?
This PR proposes a new data-structure called based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map.
## How was this patch tested?
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
BytesToBytesMap: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
hash 108 / 119 96.9 10.3 1.0X
fast hash 63 / 70 166.2 6.0 1.7X
arrayEqual 70 / 73 150.8 6.6 1.6X
Java HashMap (Long) 141 / 200 74.3 13.5 0.8X
Java HashMap (two ints) 145 / 185 72.3 13.8 0.7X
Java HashMap (UnsafeRow) 499 / 524 21.0 47.6 0.2X
BytesToBytesMap (off Heap) 483 / 548 21.7 46.0 0.2X
BytesToBytesMap (on Heap) 485 / 562 21.6 46.2 0.2X
Vectorized Hashmap 54 / 60 193.7 5.2 2.0X
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sameeragarwal/spark vectorized-hashmap
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12055.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12055
----
commit 7fd7d705347cb3f45f07d8db2e0188cb534f220d
Author: Sameer Agarwal <sa...@databricks.com>
Date: 2016-03-28T18:21:13Z
Vectorized HashMap
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203676418
**[Test build #54563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54563/consoleFull)** for PR 12055 at commit [`f11c12f`](https://github.com/apache/spark/commit/f11c12f946fc13afcafc99c850d4a3063f032429).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203248157
**[Test build #54493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54493/consoleFull)** for PR 12055 at commit [`7fd7d70`](https://github.com/apache/spark/commit/7fd7d705347cb3f45f07d8db2e0188cb534f220d).
* This patch **fails RAT tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `public class VectorizedHashMap `
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203699659
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54563/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203673714
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57844416
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of a single-key/single value vectorized hash map that can
+ * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
+ */
+public class VectorizedHashMap {
+ public ColumnarBatch batch;
+ public int[] buckets;
+ private int numBuckets;
+ private int numRows = 0;
+ private int maxSteps = 3;
+
+ public VectorizedHashMap(int capacity, double loadFactor, int maxSteps) {
+ StructType schema = new StructType()
+ .add("key", LongType)
+ .add("value", LongType);
+ this.maxSteps = maxSteps;
+ numBuckets = capacity;
+ batch = ColumnarBatch.allocate(schema, MemoryMode.ON_HEAP, (int) (numBuckets * loadFactor));
+ buckets = new int[numBuckets];
+ Arrays.fill(buckets, -1);
+ }
+
+ public int findOrInsert(long key) {
+ int idx = find(key);
+ if (idx != -1 && buckets[idx] == -1) {
+ batch.column(0).putLong(numRows, key);
+ batch.column(1).putLong(numRows, 0);
+ buckets[idx] = numRows++;
+ }
+ return idx;
+ }
+
+ public int find(long key) {
+ long h = hash(key);
+ int step = 0;
+ int idx = (int) h & (numBuckets - 1);
+ while (step < maxSteps) {
+ if ((buckets[idx] == -1) || (buckets[idx] != -1 && equals(idx, key))) return idx;
--- End diff --
I don't think you need the check for -1 twice.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203272766
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57978007
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of an append-only single-key/single value aggregate hash
+ * map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates
+ * (and fall back to the `BytesToBytesMap` if a given key isn't found). This can be potentially
+ * 'codegened' in TungstenAggregate to speed up aggregates w/ key.
+ *
+ * It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the
+ * key-value pairs. The index lookups in the array rely on linear probing (with a small number of
+ * maximum tries) and use an inexpensive hash function which makes it really efficient for a
+ * majority of lookups. However, using linear probing and an inexpensive hash function also makes it
+ * less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even
+ * for certain distribution of keys) and requires us to fall back on the latter for correctness.
+ */
+public class AggregateHashMap {
+ public ColumnarBatch batch;
+ public int[] buckets;
+
+ private int numBuckets;
+ private int numRows = 0;
+ private int maxSteps = 3;
+
+ private static int DEFAULT_NUM_BUCKETS = 65536 * 4;
--- End diff --
by `capacity` I was implying `numBuckets` instead of the capacity of the batch, but I yes, the latter makes more sense.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203247691
cc @nongli
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57844531
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of a single-key/single value vectorized hash map that can
+ * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
+ */
+public class VectorizedHashMap {
+ public ColumnarBatch batch;
--- End diff --
do these needs to be public?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203693531
**[Test build #54562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54562/consoleFull)** for PR 12055 at commit [`69a0302`](https://github.com/apache/spark/commit/69a0302294fff37945d3bd5be6da5feae348446a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57844666
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of a single-key/single value vectorized hash map that can
+ * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
+ */
+public class VectorizedHashMap {
+ public ColumnarBatch batch;
+ public int[] buckets;
+ private int numBuckets;
+ private int numRows = 0;
+ private int maxSteps = 3;
+
+ public VectorizedHashMap(int capacity, double loadFactor, int maxSteps) {
--- End diff --
I think this should take the schema as the parameter.
capacity needs to be a power of 2 for the mod to work. I'm not sure these should be exposed for the typical caller. At the very least, expose a ctor with reasonable defaults.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203272570
**[Test build #54494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54494/consoleFull)** for PR 12055 at commit [`20a932f`](https://github.com/apache/spark/commit/20a932fe039bc7737b38a889b739382a19aeaba9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203693939
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203272770
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54494/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203248160
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/12055
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203248630
**[Test build #54494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54494/consoleFull)** for PR 12055 at commit [`20a932f`](https://github.com/apache/spark/commit/20a932fe039bc7737b38a889b739382a19aeaba9).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203248162
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54493/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57977135
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of a single-key/single value vectorized hash map that can
+ * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
+ */
+public class VectorizedHashMap {
+ public ColumnarBatch batch;
--- End diff --
currently I need to access this in the benchmark to update the aggregated value
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57844469
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorizedHashMap.java ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of a single-key/single value vectorized hash map that can
+ * be potentially 'codegened' in TungstenAggregate to speed up aggregate w/ key
+ */
+public class VectorizedHashMap {
--- End diff --
I think you should comment the overall design of this data structure, where it is good and where it is bad.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203248086
**[Test build #54493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54493/consoleFull)** for PR 12055 at commit [`7fd7d70`](https://github.com/apache/spark/commit/7fd7d705347cb3f45f07d8db2e0188cb534f220d).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203699480
**[Test build #54563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54563/consoleFull)** for PR 12055 at commit [`f11c12f`](https://github.com/apache/spark/commit/f11c12f946fc13afcafc99c850d4a3063f032429).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203671857
**[Test build #54562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54562/consoleFull)** for PR 12055 at commit [`69a0302`](https://github.com/apache/spark/commit/69a0302294fff37945d3bd5be6da5feae348446a).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203699658
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203671183
Thanks, comments addressed
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12055#issuecomment-203693942
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54562/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14263][SQL] Benchmark Vectorized HashMa...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/12055#discussion_r57977414
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized;
+
+import java.util.Arrays;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.types.DataTypes.LongType;
+
+/**
+ * This is an illustrative implementation of an append-only single-key/single value aggregate hash
+ * map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates
+ * (and fall back to the `BytesToBytesMap` if a given key isn't found). This can be potentially
+ * 'codegened' in TungstenAggregate to speed up aggregates w/ key.
+ *
+ * It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the
+ * key-value pairs. The index lookups in the array rely on linear probing (with a small number of
+ * maximum tries) and use an inexpensive hash function which makes it really efficient for a
+ * majority of lookups. However, using linear probing and an inexpensive hash function also makes it
+ * less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even
+ * for certain distribution of keys) and requires us to fall back on the latter for correctness.
+ */
+public class AggregateHashMap {
+ public ColumnarBatch batch;
+ public int[] buckets;
+
+ private int numBuckets;
+ private int numRows = 0;
+ private int maxSteps = 3;
+
+ private static int DEFAULT_NUM_BUCKETS = 65536 * 4;
--- End diff --
this is weird. configure the max capacity (in the batch) and the load factor and size numbuckets to capacity / load_factor. You have dependent constants here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org