You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/01/16 10:21:54 UTC
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/10786
[SPARK-10735] [SQL] Generate aggregation w/o grouping keys [WIP]
PR is based on #10735
The benchmark show that generated aggregation could be 6X time faster than non-generated (with generated filter/range).
It also showed that declarative function is much faster than imperative one (5X faster for stddev), we may need to switch to a declarative version. (the difference need to be measured when there are grouping keys).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark gen_agg2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10786.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10786
----
commit 2da493f06d8b9b9c0e5cf641a84fb35d5a3a7ad8
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-11T23:01:05Z
whole stage codegen
commit 998b6a150bdc2e76cea973eddedaf6ab98161637
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-12T21:30:01Z
support range
commit 218412fbacf37a9c9621c277c082f45a25a9fc5e
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-12T22:11:20Z
add benchmark
commit 158cb36bd74ba11ec7dd87adaa5f1c40e8931da8
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-12T22:15:47Z
remove println and test
commit 88c51a6aa09cb4b7abc0ec6d7609d9c823a07460
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-12T22:37:20Z
fix bug
commit b76c00e2b75a869e7ea3d6a7fda28b4bdfa85be1
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-12T23:38:38Z
fix tests
commit 4e2ee8889c18de75f29243ced2b8b52b1f0a5b1b
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-13T00:44:59Z
clean the interface
commit f05524c2f9aba8c422f571ba6e6a7d8b02a661d8
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-13T01:07:08Z
update test
commit 43139a84e637598953f0c757ba2ea62f4ad4bfe8
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-13T23:13:30Z
fix style, improve comments
commit 73fe07499faed9fdb33192337b0996a299d8e6bf
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-13T23:32:48Z
fix style
commit 32309e14e39af9f418faa3d837cebe64c3c87b84
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-14T06:45:09Z
fix test
commit 38029bc5eff03cd4a7334bbe7c16b543f7247030
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T07:29:00Z
WholeStageCodegen/FakeInput as adapter
commit 1f9ddb8538ed5c2017e6310815eef32f065ca0cb
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T07:54:26Z
address comments
commit 2a3f0e32a81898da1399ad1d0b98ec913dbc5668
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T08:00:08Z
Merge branch 'master' of github.com:apache/spark into whole2
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
commit c1dd60a21e1a016b5479cf89cdef035ac53167f0
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T08:05:27Z
fix style
commit 16ce50c4458b7ab4bd4c28e88d6f7f600bbeebf7
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T16:28:44Z
Merge branch 'master' of github.com:apache/spark into whole2
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
commit d4530812c29ea15c92a07cc9aae3db05a9d8e06a
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T16:39:20Z
fix bug
commit 34a0a6f15c48935bc79afd7b08e5553552690adc
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T17:39:51Z
fix test
commit 908c8cb22596e0ae662d388014624eb289c753d5
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T18:11:15Z
address comments
commit 1feab20decb487ed66e712323565bb4d58f7637f
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T19:24:58Z
use sparkPlan for checking
commit c9741ea856ecccdcb2dc349ee576080b32b50e19
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T21:52:41Z
create a rule for whole stage codegen
commit 7c0570335da47847c3462306a930abc91ee1f932
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T22:02:07Z
renaming
commit 0b401067347f57d0f7b92c999bb2bcf7e3e95d3d
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-15T22:39:05Z
fix style
commit d12a8da88e37344bc429fb4f67c0258a28b209c4
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-16T03:21:00Z
Merge branch 'master' of github.com:apache/spark into whole2
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodegenFallback.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala
commit 3df7e5dfad96d48eb47d6f81d85457ae6d3a4156
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-16T06:19:25Z
generate aggregation
commit b43f26257885a912a45850cd5fad1936a2dbfe1b
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-16T06:37:56Z
support impremitive function
commit 640a4b57a687e16ad02c22c4fe787a540ff890a8
Author: Davies Liu <da...@databricks.com>
Date: 2016-01-16T09:13:29Z
support imperative functions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172294221
**[Test build #49540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49540/consoleFull)** for PR 10786 at commit [`10f6bb9`](https://github.com/apache/spark/commit/10f6bb947073cd73c0038db86ba23a09cfb2f188).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299073
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172250474
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49537/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172300223
**[Test build #49548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49548/consoleFull)** for PR 10786 at commit [`c8ed5d9`](https://github.com/apache/spark/commit/c8ed5d9884a0b37587c1d6552dc31d56dad7cac5).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299063
**[Test build #49546 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49546/consoleFull)** for PR 10786 at commit [`4c655dd`](https://github.com/apache/spark/commit/4c655dd45ec513c49bda8b827cac10407609a890).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172300233
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172293649
**[Test build #49540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49540/consoleFull)** for PR 10786 at commit [`10f6bb9`](https://github.com/apache/spark/commit/10f6bb947073cd73c0038db86ba23a09cfb2f188).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172300234
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49548/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-174729286
This PR will be replaced by multiple small PRs.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299253
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299254
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49547/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172294229
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/10786#discussion_r50157642
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala ---
@@ -46,15 +47,64 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
/*
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
- Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
- -------------------------------------------------------------------------
- Without whole stage codegen 6725.52 31.18 1.00 X
- With whole stage codegen 2233.05 93.91 3.01 X
+ Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ Without whole stage codegen 6585.36 31.85 1.00 X
+ With whole stage codegen 343.80 609.99 19.15 X
+ */
+ benchmark.run()
+ }
+
+ def testImperitaveAggregation(values: Int): Unit = {
+
+ val benchmark = new Benchmark("aggregation", values)
+
+ benchmark.addCase("ImpAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ benchmark.addCase("ImpAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ /*
+ Before optimizing CentralMomentAgg and generated mutable projection:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 9047.35 11.59 1.00 X
+ DeclAgg w/o whole stage codegen 6507.27 16.11 1.39 X
+ ImpAgg w whole stage codegen 6947.30 15.09 1.30 X
+ DeclAgg w whole stage codegen 1376.74 76.16 6.57 X
+
+ After optimization:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 6159.03 17.03 1.00 X
+ DeclAgg w/o whole stage codegen 5248.69 19.98 1.17 X
+ ImpAgg w whole stage codegen 4202.30 24.95 1.47 X
+ DeclAgg w whole stage codegen 1367.34 76.69 4.50 X
--- End diff --
Glad to see 5x speed-up! +1 on switching to declarative on `stddev`. But for `skewness` and `kurtosis`, we still need to benchmark the performance to decide because their expressions are more complex.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/10786#discussion_r50157926
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala ---
@@ -46,15 +47,64 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
/*
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
- Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
- -------------------------------------------------------------------------
- Without whole stage codegen 6725.52 31.18 1.00 X
- With whole stage codegen 2233.05 93.91 3.01 X
+ Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ Without whole stage codegen 6585.36 31.85 1.00 X
+ With whole stage codegen 343.80 609.99 19.15 X
+ */
+ benchmark.run()
+ }
+
+ def testImperitaveAggregation(values: Int): Unit = {
+
+ val benchmark = new Benchmark("aggregation", values)
+
+ benchmark.addCase("ImpAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ benchmark.addCase("ImpAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ /*
+ Before optimizing CentralMomentAgg and generated mutable projection:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 9047.35 11.59 1.00 X
+ DeclAgg w/o whole stage codegen 6507.27 16.11 1.39 X
+ ImpAgg w whole stage codegen 6947.30 15.09 1.30 X
+ DeclAgg w whole stage codegen 1376.74 76.16 6.57 X
+
+ After optimization:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 6159.03 17.03 1.00 X
+ DeclAgg w/o whole stage codegen 5248.69 19.98 1.17 X
+ ImpAgg w whole stage codegen 4202.30 24.95 1.47 X
+ DeclAgg w whole stage codegen 1367.34 76.69 4.50 X
--- End diff --
I've always preferred declarative aggregate because its much easier to optimize (very cool the kind of speed ups you are getting!). As such, I'd support having all of our built in functions done this way.
@mengxr argues that its too confusing for users and that we should also support the imperative one. How high cost is this for us?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172250463
**[Test build #49537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49537/consoleFull)** for PR 10786 at commit [`6911a05`](https://github.com/apache/spark/commit/6911a05e951d6452995936a50b5949e40595e7a0).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172177058
**[Test build #49526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49526/consoleFull)** for PR 10786 at commit [`640a4b5`](https://github.com/apache/spark/commit/640a4b57a687e16ad02c22c4fe787a540ff890a8).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `abstract class StddevAgg(child: Expression) extends DeclarativeAggregate `
* `case class StddevPop1(child: Expression) extends StddevAgg(child) `
* `case class StddevSamp1(child: Expression) extends StddevAgg(child) `
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172298524
**[Test build #49546 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49546/consoleFull)** for PR 10786 at commit [`4c655dd`](https://github.com/apache/spark/commit/4c655dd45ec513c49bda8b827cac10407609a890).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172294230
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49540/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172248892
**[Test build #49537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49537/consoleFull)** for PR 10786 at commit [`6911a05`](https://github.com/apache/spark/commit/6911a05e951d6452995936a50b5949e40595e7a0).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172177061
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49526/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172177060
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299479
**[Test build #49548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49548/consoleFull)** for PR 10786 at commit [`c8ed5d9`](https://github.com/apache/spark/commit/c8ed5d9884a0b37587c1d6552dc31d56dad7cac5).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172299074
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49546/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by davies <gi...@git.apache.org>.
Github user davies closed the pull request at:
https://github.com/apache/spark/pull/10786
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172250473
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10786#issuecomment-172176868
**[Test build #49526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49526/consoleFull)** for PR 10786 at commit [`640a4b5`](https://github.com/apache/spark/commit/640a4b57a687e16ad02c22c4fe787a540ff890a8).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/10786#discussion_r50043531
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Stddev.scala ---
@@ -79,3 +81,116 @@ case class StddevPop(
}
}
}
+
+// Compute standard deviation based on online algorithm specified here:
+// http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+abstract class StddevAgg(child: Expression) extends DeclarativeAggregate {
--- End diff --
This is included for benchmark, could be used to replace the imperative one (should be done in follow up PR or this one?)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/10786#discussion_r50038294
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala ---
@@ -46,15 +47,64 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
/*
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
- Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
- -------------------------------------------------------------------------
- Without whole stage codegen 6725.52 31.18 1.00 X
- With whole stage codegen 2233.05 93.91 3.01 X
+ Single Int Column Scan: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ Without whole stage codegen 6585.36 31.85 1.00 X
+ With whole stage codegen 343.80 609.99 19.15 X
+ */
+ benchmark.run()
+ }
+
+ def testImperitaveAggregation(values: Int): Unit = {
+
+ val benchmark = new Benchmark("aggregation", values)
+
+ benchmark.addCase("ImpAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w/o whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ benchmark.addCase("ImpAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
+ }
+
+ benchmark.addCase("DeclAgg w whole stage codegen") { iter =>
+ sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
+ sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
+ }
+
+ /*
+ Before optimizing CentralMomentAgg and generated mutable projection:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 9047.35 11.59 1.00 X
+ DeclAgg w/o whole stage codegen 6507.27 16.11 1.39 X
+ ImpAgg w whole stage codegen 6947.30 15.09 1.30 X
+ DeclAgg w whole stage codegen 1376.74 76.16 6.57 X
+
+ After optimization:
+
+ Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
+ aggregation: Avg Time(ms) Avg Rate(M/s) Relative Rate
+ -------------------------------------------------------------------------------
+ ImpAgg w/o whole stage codegen 6159.03 17.03 1.00 X
+ DeclAgg w/o whole stage codegen 5248.69 19.98 1.17 X
+ ImpAgg w whole stage codegen 4202.30 24.95 1.47 X
+ DeclAgg w whole stage codegen 1367.34 76.69 4.50 X
--- End diff --
Based on this benchmark, Declarative function is faster than imperative function, both in whole stage codegen or not. Should we switch to implement all builtin aggregate functions as declarative? cc @mengxr @rxin @marmbrus @rxin .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org