You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by peter-toth <gi...@git.apache.org> on 2018/10/23 08:50:07 UTC

[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...

GitHub user peter-toth opened a pull request:

    https://github.com/apache/spark/pull/22804

    [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to…

    ## What changes were proposed in this pull request?
    
    Refactor ObjectHashAggregateExecBenchmark to use main method
    
    ## How was this patch tested?
    
    Manually tested:
    ```
    bin/spark-submit --class org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark --jars sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar,core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT.jar --packages org.spark-project.hive:hive-exec:1.2.1.spark2 sql/hive/target/spark-hive_2.11-3.0.0-SNAPSHOT-tests.jar
    ```
    Generated results with:
    ```
    SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark"
    ```
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/peter-toth/spark SPARK-25665

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22804.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22804
    
----
commit cf2bb2c0bea88110d0b20347177bafa4f129499c
Author: Peter Toth <pe...@...>
Date:   2018-10-14T14:19:52Z

    [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExecBenchmark to use main method

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #97940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97940/testReport)** for PR 22804 at commit [`2ed884b`](https://github.com/apache/spark/commit/2ed884b7c61e677e95da06b9d3376ed719afd862).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22804


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #97972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97972/testReport)** for PR 22804 at commit [`37b40ae`](https://github.com/apache/spark/commit/37b40aeec3e697af28d4d84fcf04570e8e03f329).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #97940 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97940/testReport)** for PR 22804 at commit [`2ed884b`](https://github.com/apache/spark/commit/2ed884b7c61e677e95da06b9d3376ed719afd862).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...

Posted by peter-toth <gi...@git.apache.org>.

Github user peter-toth commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22804#discussion_r227470048
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
    @@ -21,207 +21,212 @@ import scala.concurrent.duration._
     
     import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
     
    -import org.apache.spark.benchmark.Benchmark
    -import org.apache.spark.sql.Column
    -import org.apache.spark.sql.catalyst.FunctionIdentifier
    -import org.apache.spark.sql.catalyst.catalog.CatalogFunction
    +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
    +import org.apache.spark.sql.{Column, SparkSession}
     import org.apache.spark.sql.catalyst.expressions.Literal
     import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    -import org.apache.spark.sql.hive.HiveSessionCatalog
    +import org.apache.spark.sql.catalyst.plans.SQLHelper
     import org.apache.spark.sql.hive.execution.TestingTypedCount
    -import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.hive.test.TestHive
     import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.LongType
     
    -class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
    -  ignore("Hive UDAF vs Spark AF") {
    -    val N = 2 << 15
    -
    -    val benchmark = new Benchmark(
    -      name = "hive udaf vs spark af",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 5.seconds,
    -      minTime = 10.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
    -
    -    sparkSession.range(N).createOrReplaceTempView("t")
    -
    -    benchmark.addCase("hive udaf w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("spark af w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("hive udaf w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql(
    -        s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
    -    spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
    -    hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
    -    spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
    -    spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
    -    val N: Long = 1024 * 1024 * 100
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 1,
    -      warmupTime = 10.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    def typed_count(column: Column): Column =
    -      Column(TestingTypedCount(column.expr).toAggregateExpression())
    -
    -    val df = sparkSession.range(N)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("sort agg w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
    -    object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
    -    object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
    -    sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
    -    object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
    -    val N = 2 << 20
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 15.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    val df = sparkSession.range(N).coalesce(1)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    +/**
    + * Benchmark to measure read performance with Filter pushdown.
    --- End diff --
    
    Thanks @wangyum , fixed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #98012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98012/testReport)** for PR 22804 at commit [`6849a87`](https://github.com/apache/spark/commit/6849a87a424140f15ddc308cee4a0087715f2f0f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #98012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98012/testReport)** for PR 22804 at commit [`6849a87`](https://github.com/apache/spark/commit/6849a87a424140f15ddc308cee4a0087715f2f0f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97940/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98012/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97972/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22804#discussion_r227330755
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
    @@ -21,207 +21,212 @@ import scala.concurrent.duration._
     
     import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
     
    -import org.apache.spark.benchmark.Benchmark
    -import org.apache.spark.sql.Column
    -import org.apache.spark.sql.catalyst.FunctionIdentifier
    -import org.apache.spark.sql.catalyst.catalog.CatalogFunction
    +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
    +import org.apache.spark.sql.{Column, SparkSession}
     import org.apache.spark.sql.catalyst.expressions.Literal
     import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    -import org.apache.spark.sql.hive.HiveSessionCatalog
    +import org.apache.spark.sql.catalyst.plans.SQLHelper
     import org.apache.spark.sql.hive.execution.TestingTypedCount
    -import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.hive.test.TestHive
     import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.LongType
     
    -class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
    -  ignore("Hive UDAF vs Spark AF") {
    -    val N = 2 << 15
    -
    -    val benchmark = new Benchmark(
    -      name = "hive udaf vs spark af",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 5.seconds,
    -      minTime = 10.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
    -
    -    sparkSession.range(N).createOrReplaceTempView("t")
    -
    -    benchmark.addCase("hive udaf w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("spark af w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("hive udaf w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql(
    -        s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
    -    spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
    -    hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
    -    spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
    -    spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
    -    val N: Long = 1024 * 1024 * 100
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 1,
    -      warmupTime = 10.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    def typed_count(column: Column): Column =
    -      Column(TestingTypedCount(column.expr).toAggregateExpression())
    -
    -    val df = sparkSession.range(N)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("sort agg w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
    -    object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
    -    object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
    -    sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
    -    object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
    -    val N = 2 << 20
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 15.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    val df = sparkSession.range(N).coalesce(1)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    +/**
    + * Benchmark to measure read performance with Filter pushdown.
    --- End diff --
    
    read performance with Filter pushdown?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by peter-toth <gi...@git.apache.org>.

Github user peter-toth commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Thanks @dongjoon-hyun , @wangyum for the review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggre...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22804#discussion_r227582826
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/execution/benchmark/ObjectHashAggregateExecBenchmark.scala ---
    @@ -21,207 +21,212 @@ import scala.concurrent.duration._
     
     import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox
     
    -import org.apache.spark.benchmark.Benchmark
    -import org.apache.spark.sql.Column
    -import org.apache.spark.sql.catalyst.FunctionIdentifier
    -import org.apache.spark.sql.catalyst.catalog.CatalogFunction
    +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
    +import org.apache.spark.sql.{Column, SparkSession}
     import org.apache.spark.sql.catalyst.expressions.Literal
     import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    -import org.apache.spark.sql.hive.HiveSessionCatalog
    +import org.apache.spark.sql.catalyst.plans.SQLHelper
     import org.apache.spark.sql.hive.execution.TestingTypedCount
    -import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.hive.test.TestHive
     import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.LongType
     
    -class ObjectHashAggregateExecBenchmark extends BenchmarkWithCodegen with TestHiveSingleton {
    -  ignore("Hive UDAF vs Spark AF") {
    -    val N = 2 << 15
    -
    -    val benchmark = new Benchmark(
    -      name = "hive udaf vs spark af",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 5.seconds,
    -      minTime = 10.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    registerHiveFunction("hive_percentile_approx", classOf[GenericUDAFPercentileApprox])
    -
    -    sparkSession.range(N).createOrReplaceTempView("t")
    -
    -    benchmark.addCase("hive udaf w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql("SELECT hive_percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("spark af w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql("SELECT percentile_approx(id, 0.5) FROM t").collect()
    -    }
    -
    -    benchmark.addCase("hive udaf w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      sparkSession.sql(
    -        s"SELECT hive_percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.addCase("spark af w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      sparkSession.sql(
    -        s"SELECT percentile_approx(id, 0.5) FROM t GROUP BY CAST(id / ${N / 4} AS BIGINT)"
    -      ).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
    -    spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
    -    hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
    -    spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
    -    spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - typed_count") {
    -    val N: Long = 1024 * 1024 * 100
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 1,
    -      warmupTime = 10.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    def typed_count(column: Column): Column =
    -      Column(TestingTypedCount(column.expr).toAggregateExpression())
    -
    -    val df = sparkSession.range(N)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" < (N / 2)).agg(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("sort agg w/o group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/o group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.select(typed_count($"id")).collect()
    -    }
    -
    -    benchmark.run()
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    -
    -    object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    ------------------------------------------------------------------------------------------------
    -    sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
    -    object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
    -    object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
    -    sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
    -    object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
    -     */
    -  }
    -
    -  ignore("ObjectHashAggregateExec vs SortAggregateExec - percentile_approx") {
    -    val N = 2 << 20
    -
    -    val benchmark = new Benchmark(
    -      name = "object agg v.s. sort agg",
    -      valuesPerIteration = N,
    -      minNumIters = 5,
    -      warmupTime = 15.seconds,
    -      minTime = 45.seconds,
    -      outputPerIteration = true
    -    )
    -
    -    import sparkSession.implicits._
    -
    -    val df = sparkSession.range(N).coalesce(1)
    -
    -    benchmark.addCase("sort agg w/ group by") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "false")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/o fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    -    }
    -
    -    benchmark.addCase("object agg w/ group by w/ fallback") { _ =>
    -      sparkSession.conf.set(SQLConf.USE_OBJECT_HASH_AGG.key, "true")
    -      sparkSession.conf.set(SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key, "2")
    -      df.groupBy($"id" / (N / 4) cast LongType).agg(percentile_approx($"id", 0.5)).collect()
    +/**
    + * Benchmark to measure hash based aggregation.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt: bin/spark-submit --class <this class>
    + *        --jars <spark catalyst test jar>,<spark core test jar>,<spark hive jar>
    + *        --packages org.spark-project.hive:hive-exec:1.2.1.spark2
    + *        <spark hive test jar>
    + *   2. build/sbt "hive/test:runMain <this class>"
    + *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "hive/test:runMain <this class>"
    + *      Results will be written to "benchmarks/ObjectHashAggregateExecBenchmark-results.txt".
    + * }}}
    + */
    +object ObjectHashAggregateExecBenchmark extends BenchmarkBase with SQLHelper {
    +
    +  val spark: SparkSession = TestHive.sparkSession
    +
    +  override def runBenchmarkSuite(): Unit = {
    +    runBenchmark("Hive UDAF vs Spark AF") {
    --- End diff --
    
    Hi, @peter-toth . Thank you for making this PR.
    Currently, `runBenchmarkSuite` is too long. Could you make a separate function for each test case? For example, `ignore("Hive UDAF vs Spark AF")` can be a single function. And `runBenchmarkSuite` will call a series of those functions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    **[Test build #97972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97972/testReport)** for PR 22804 at commit [`37b40ae`](https://github.com/apache/spark/commit/37b40aeec3e697af28d4d84fcf04570e8e03f329).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by peter-toth <gi...@git.apache.org>.

Github user peter-toth commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Thanks @dongjoon-hyun for the fixes. Merged.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22804: [SPARK-25665][SQL][TEST] Refactor ObjectHashAggregateExe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22804
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org