You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by peter-toth <gi...@git.apache.org> on 2018/10/23 08:50:07 UTC
[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...
GitHub user peter-toth opened a pull request:
https://github.com/apache/spark/pull/22804
[SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to…
## What changes were proposed in this pull request?
Refactor ObjectHashAggregateExecBenchmark to use main method
## How was this patch tested?
Manually tested:
```
bin/spark-submit --class org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark --jars sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar,core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT.jar --packages org.spark-project.hive:hive-exec:1.2.1.spark2 sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT-tests.jar
```
Generated results with:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark"
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/peter-toth/spark SPARK-25665
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22804.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22804
----
commit cf2bb2c0bea88110d0b20347177bafa4f129499c
Author: Peter Toth <pe...@...>
Date: 2018-10-14T14:19:52Z
[SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to use main method
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #97940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97940/testReport)** for PR 22804 at commit [`2ed884b`](https://github.com/apache/spark/commit/2ed884b7c61e677e95da06b9d3376ed719afd862).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/22804
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #97972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97972/testReport)** for PR 22804 at commit [`37b40ae`](https://github.com/apache/spark/commit/37b40aeec3e697af28d4d84fcf04570e8e03f329).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #97940 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97940/testReport)** for PR 22804 at commit [`2ed884b`](https://github.com/apache/spark/commit/2ed884b7c61e677e95da06b9d3376ed719afd862).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...
Posted by peter-toth <gi...@git.apache.org>.
Github user peter-toth commented on a diff in the pull request:
https://github.com/apache/spark/pull/22804#discussion_r227470048
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
@@ -21,207 +21,212 @@ import scala.concurrent.duration._
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
-import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.Column
-import org.apache.spark.sql.catalyst.FunctionIdentifier
-import org.apache.spark.sql.catalyst.catalog.CatalogFunction
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
-import org.apache.spark.sql.hive.HiveSessionCatalog
+import org.apache.spark.sql.catalyst.plans.SQLHelper
import org.apache.spark.sql.hive.execution.TestingTypedCount
-import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.hive.test.TestHive
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.LongType
-class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
- ignore("Hive UDAF vs Spark AF") {
- val N = 2 << 15
-
- val benchmark = new Benchmark(
- name = "hive udaf vs spark af",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 5.seconds,
- minTime = 10.seconds,
- outputPerIteration = true
- )
-
- registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
-
- sparkSession.range(N).createOrReplaceTempView("t")
-
- benchmark.addCase("hive udaf w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("spark af w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("hive udaf w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql(
- s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- hive udaf w/o group by 5326 / 5408 0.0 81264.2 1.0X
- spark af w/o group by 93 / 111 0.7 1415.6 57.4X
- hive udaf w/ group by 3804 / 3946 0.0 58050.1 1.4X
- spark af w/ group by w/o fallback 71 / 90 0.9 1085.7 74.8X
- spark af w/ group by w/ fallback 98 / 111 0.7 1501.6 54.1X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
- val N: Long = 1024 * 1024 * 100
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 1,
- warmupTime = 10.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- def typed_count(column: Column): Column =
- Column(TestingTypedCount(column.expr).toAggregateExpression())
-
- val df = sparkSession.range(N)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("sort agg w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- object agg v.s. sort agg: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- sort agg w/ group by 31251 / 31908 3.4 298.0 1.0X
- object agg w/ group by w/o fallback 6903 / 7141 15.2 65.8 4.5X
- object agg w/ group by w/ fallback 20945 / 21613 5.0 199.7 1.5X
- sort agg w/o group by 4734 / 5463 22.1 45.2 6.6X
- object agg w/o group by w/o fallback 4310 / 4529 24.3 41.1 7.3X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
- val N = 2 << 20
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 15.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- val df = sparkSession.range(N).coalesce(1)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
+/**
+ * Benchmark to measure read performance with Filter pushdown.
--- End diff --
Thanks @wangyum , fixed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #98012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98012/testReport)** for PR 22804 at commit [`6849a87`](https://github.com/apache/spark/commit/6849a87a424140f15ddc308cee4a0087715f2f0f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #98012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98012/testReport)** for PR 22804 at commit [`6849a87`](https://github.com/apache/spark/commit/6849a87a424140f15ddc308cee4a0087715f2f0f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97940/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98012/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97972/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/22804#discussion_r227330755
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
@@ -21,207 +21,212 @@ import scala.concurrent.duration._
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
-import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.Column
-import org.apache.spark.sql.catalyst.FunctionIdentifier
-import org.apache.spark.sql.catalyst.catalog.CatalogFunction
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
-import org.apache.spark.sql.hive.HiveSessionCatalog
+import org.apache.spark.sql.catalyst.plans.SQLHelper
import org.apache.spark.sql.hive.execution.TestingTypedCount
-import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.hive.test.TestHive
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.LongType
-class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
- ignore("Hive UDAF vs Spark AF") {
- val N = 2 << 15
-
- val benchmark = new Benchmark(
- name = "hive udaf vs spark af",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 5.seconds,
- minTime = 10.seconds,
- outputPerIteration = true
- )
-
- registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
-
- sparkSession.range(N).createOrReplaceTempView("t")
-
- benchmark.addCase("hive udaf w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("spark af w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("hive udaf w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql(
- s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- hive udaf w/o group by 5326 / 5408 0.0 81264.2 1.0X
- spark af w/o group by 93 / 111 0.7 1415.6 57.4X
- hive udaf w/ group by 3804 / 3946 0.0 58050.1 1.4X
- spark af w/ group by w/o fallback 71 / 90 0.9 1085.7 74.8X
- spark af w/ group by w/ fallback 98 / 111 0.7 1501.6 54.1X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
- val N: Long = 1024 * 1024 * 100
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 1,
- warmupTime = 10.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- def typed_count(column: Column): Column =
- Column(TestingTypedCount(column.expr).toAggregateExpression())
-
- val df = sparkSession.range(N)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("sort agg w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- object agg v.s. sort agg: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- sort agg w/ group by 31251 / 31908 3.4 298.0 1.0X
- object agg w/ group by w/o fallback 6903 / 7141 15.2 65.8 4.5X
- object agg w/ group by w/ fallback 20945 / 21613 5.0 199.7 1.5X
- sort agg w/o group by 4734 / 5463 22.1 45.2 6.6X
- object agg w/o group by w/o fallback 4310 / 4529 24.3 41.1 7.3X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
- val N = 2 << 20
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 15.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- val df = sparkSession.range(N).coalesce(1)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
+/**
+ * Benchmark to measure read performance with Filter pushdown.
--- End diff --
read performance with Filter pushdown?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by peter-toth <gi...@git.apache.org>.
Github user peter-toth commented on the issue:
https://github.com/apache/spark/pull/22804
Thanks @dongjoon-hyun , @wangyum for the review.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22804
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/22804#discussion_r227582826
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
@@ -21,207 +21,212 @@ import scala.concurrent.duration._
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
-import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.Column
-import org.apache.spark.sql.catalyst.FunctionIdentifier
-import org.apache.spark.sql.catalyst.catalog.CatalogFunction
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
-import org.apache.spark.sql.hive.HiveSessionCatalog
+import org.apache.spark.sql.catalyst.plans.SQLHelper
import org.apache.spark.sql.hive.execution.TestingTypedCount
-import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.hive.test.TestHive
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.LongType
-class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
- ignore("Hive UDAF vs Spark AF") {
- val N = 2 << 15
-
- val benchmark = new Benchmark(
- name = "hive udaf vs spark af",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 5.seconds,
- minTime = 10.seconds,
- outputPerIteration = true
- )
-
- registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
-
- sparkSession.range(N).createOrReplaceTempView("t")
-
- benchmark.addCase("hive udaf w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("spark af w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
- }
-
- benchmark.addCase("hive udaf w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- sparkSession.sql(
- s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- sparkSession.sql(
- s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
- ).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- hive udaf w/o group by 5326 / 5408 0.0 81264.2 1.0X
- spark af w/o group by 93 / 111 0.7 1415.6 57.4X
- hive udaf w/ group by 3804 / 3946 0.0 58050.1 1.4X
- spark af w/ group by w/o fallback 71 / 90 0.9 1085.7 74.8X
- spark af w/ group by w/ fallback 98 / 111 0.7 1501.6 54.1X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
- val N: Long = 1024 * 1024 * 100
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 1,
- warmupTime = 10.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- def typed_count(column: Column): Column =
- Column(TestingTypedCount(column.expr).toAggregateExpression())
-
- val df = sparkSession.range(N)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
- }
-
- benchmark.addCase("sort agg w/o group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.select(typed_count($"id")).collect()
- }
-
- benchmark.run()
-
- /*
- Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
- Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-
- object agg v.s. sort agg: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
- ------------------------------------------------------------------------------------------------
- sort agg w/ group by 31251 / 31908 3.4 298.0 1.0X
- object agg w/ group by w/o fallback 6903 / 7141 15.2 65.8 4.5X
- object agg w/ group by w/ fallback 20945 / 21613 5.0 199.7 1.5X
- sort agg w/o group by 4734 / 5463 22.1 45.2 6.6X
- object agg w/o group by w/o fallback 4310 / 4529 24.3 41.1 7.3X
- */
- }
-
- ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
- val N = 2 << 20
-
- val benchmark = new Benchmark(
- name = "object agg v.s. sort agg",
- valuesPerIteration = N,
- minNumIters = 5,
- warmupTime = 15.seconds,
- minTime = 45.seconds,
- outputPerIteration = true
- )
-
- import sparkSession.implicits._
-
- val df = sparkSession.range(N).coalesce(1)
-
- benchmark.addCase("sort agg w/ group by") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
- }
-
- benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
- sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
- sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
- df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
+/**
+ * Benchmark to measure hash based aggregation.
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class>
+ * --jars <spark catalyst test jar>,<spark core test jar>,<spark hive jar>
+ * --packages org.spark-project.hive:hive-exec:1.2.1.spark2
+ * <spark hive test jar>
+ * 2. build/sbt "hive/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain <this class>"
+ * Results will be written to "benchmarks/ObjectHashAggregateExecBenchmark-results.txt".
+ * }}}
+ */
+object ObjectHashAggregateExecBenchmark extends BenchmarkBase with SQLHelper {
+
+ val spark: SparkSession = TestHive.sparkSession
+
+ override def runBenchmarkSuite(): Unit = {
+ runBenchmark("Hive UDAF vs Spark AF") {
--- End diff --
Hi, @peter-toth . Thank you for making this PR.
Currently, `runBenchmarkSuite` is too long. Could you make a separate function for each test case? For example, `ignore("Hive UDAF vs Spark AF")` can be a single function. And `runBenchmarkSuite` will call a series of those functions.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22804
**[Test build #97972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97972/testReport)** for PR 22804 at commit [`37b40ae`](https://github.com/apache/spark/commit/37b40aeec3e697af28d4d84fcf04570e8e03f329).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by peter-toth <gi...@git.apache.org>.
Github user peter-toth commented on the issue:
https://github.com/apache/spark/pull/22804
Thanks @dongjoon-hyun for the fixes. Merged.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22804
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org