You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/10/07 08:38:31 UTC

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/22661

    [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method

    ## What changes were proposed in this pull request?
    
    Refactor `JoinBenchmark` to use main method.
    1. use `spark-submit`:
    ```console
    bin/spark-submit --class  org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
    ```
    
    2. Generate benchmark result:
    ```console
    SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
    ```
    
    ## How was this patch tested?
    
    manual tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-25664

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22661
    
----
commit 4339b1cbc5de7e54a7cd5be818fcf3dab249a351
Author: Yuming Wang <yu...@...>
Date:   2018-10-07T08:34:54Z

    Refactor JoinBenchmark

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97299/testReport)** for PR 22661 at commit [`28f9b9a`](https://github.com/apache/spark/commit/28f9b9a8a26caf8750aa2e8c8e2bc793b3773d98).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97279/testReport)** for PR 22661 at commit [`3be13b1`](https://github.com/apache/spark/commit/3be13b16f1a59ffbd158265f54ad4f8d511d2018).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97279/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224375578
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -48,13 +48,11 @@ object JoinBenchmark extends SqlBasedBenchmark {
         }
       }
     
    -
       def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    --- End diff --
    
    For this change, we need rerun the benchmark to get a new result.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97301/testReport)** for PR 22661 at commit [`cd8b664`](https://github.com/apache/spark/commit/cd8b664e17ce613061cf046ee2b5c3f223c1afa7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3918/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    cc @dongjoon-hyun 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3920/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r223220438
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,164 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key") {
    +  def broadcastHashJoinTwoLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim3 = broadcast(sparkSession.range(M)
    +    val dim3 = broadcast(spark.range(M)
           .selectExpr("id as k1", "id as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 longs", N) {
    -      val df = sparkSession.range(N).join(dim3,
    +    codegenBenchmark("Join w 2 longs", N) {
    +      val df = spark.range(N).join(dim3,
             (col("id") % M) === col("k1") && (col("id") % M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs:                     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs codegen=false             5905 / 6123          3.6         281.6       1.0X
    -     *Join w 2 longs codegen=true              2230 / 2529          9.4         106.3       2.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key with duplicates") {
    +  def broadcastHashJoinTwoLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim4 = broadcast(sparkSession.range(M)
    +    val dim4 = broadcast(spark.range(M)
           .selectExpr("cast(id/10 as long) as k1", "cast(id/10 as long) as k2"))
     
    -    runBenchmark("Join w 2 longs duplicated", N) {
    -      val df = sparkSession.range(N).join(dim4,
    +    codegenBenchmark("Join w 2 longs duplicated", N) {
    +      val df = spark.range(N).join(dim4,
             (col("id") bitwiseAND M) === col("k1") && (col("id") bitwiseAND M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs duplicated:          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs duplicated codegen=false      6420 / 6587          3.3         306.1       1.0X
    -     *Join w 2 longs duplicated codegen=true      2080 / 2139         10.1          99.2       3.1X
    -     */
       }
     
    -  ignore("broadcast hash join, outer join long key") {
    +
    +  def broadcastHashJoinOuterJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("outer join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "left")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("outer join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "left")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *outer join w long:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *outer join w long codegen=false          3055 / 3189          6.9         145.7       1.0X
    -     *outer join w long codegen=true            261 /  276         80.5          12.4      11.7X
    -     */
       }
     
    -  ignore("broadcast hash join, semi join long key") {
    +
    +  def broadcastHashJoinSemiJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("semi join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("semi join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *semi join w long:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *semi join w long codegen=false           1912 / 1990         11.0          91.2       1.0X
    -     *semi join w long codegen=true             237 /  244         88.3          11.3       8.1X
    -     */
       }
     
    -  ignore("sort merge join") {
    +  def sortMergeJoin(): Unit = {
         val N = 2 << 20
    -    runBenchmark("merge join", N) {
    -      val df1 = sparkSession.range(N).selectExpr(s"id * 2 as k1")
    -      val df2 = sparkSession.range(N).selectExpr(s"id * 3 as k2")
    +    codegenBenchmark("merge join", N) {
    +      val df1 = spark.range(N).selectExpr(s"id * 2 as k1")
    +      val df2 = spark.range(N).selectExpr(s"id * 3 as k2")
           val df = df1.join(df2, col("k1") === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *merge join:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *merge join codegen=false                 1588 / 1880          1.3         757.1       1.0X
    -     *merge join codegen=true                  1477 / 1531          1.4         704.2       1.1X
    -     */
       }
     
    -  ignore("sort merge join with duplicates") {
    +  def sortMergeJoinWithDuplicates(): Unit = {
         val N = 2 << 20
    -    runBenchmark("sort merge join", N) {
    -      val df1 = sparkSession.range(N)
    +    codegenBenchmark("sort merge join with duplicates", N) {
    +      val df1 = spark.range(N)
             .selectExpr(s"(id * 15485863) % ${N*10} as k1")
    -      val df2 = sparkSession.range(N)
    +      val df2 = spark.range(N)
             .selectExpr(s"(id * 15485867) % ${N*10} as k2")
           val df = df1.join(df2, col("k1") === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *sort merge join:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *sort merge join codegen=false            3626 / 3667          0.6        1728.9       1.0X
    -     *sort merge join codegen=true             3405 / 3438          0.6        1623.8       1.1X
    -     */
       }
     
    -  ignore("shuffle hash join") {
    -    val N = 4 << 20
    -    sparkSession.conf.set("spark.sql.shuffle.partitions", "2")
    -    sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", "10000000")
    -    sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")
    -    runBenchmark("shuffle hash join", N) {
    -      val df1 = sparkSession.range(N).selectExpr(s"id as k1")
    -      val df2 = sparkSession.range(N / 3).selectExpr(s"id * 3 as k2")
    -      val df = df1.join(df2, col("k1") === col("k2"))
    -      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[ShuffledHashJoinExec]).isDefined)
    -      df.count()
    +  def shuffleHashJoin(): Unit = {
    +    val N: Long = 4 << 20
    +    withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2",
    --- End diff --
    
    nit. Could you put `SQLConf.SHUFFLE_PARTITIONS.key` at the next line? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97301/testReport)** for PR 22661 at commit [`cd8b664`](https://github.com/apache/spark/commit/cd8b664e17ce613061cf046ee2b5c3f223c1afa7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3906/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97287/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97301/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97090/testReport)** for PR 22661 at commit [`4859a9f`](https://github.com/apache/spark/commit/4859a9f5e78edf81c211c304a57e2603e60b2cc7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3899/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224523944
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,161 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    --- End diff --
    
    `aggregate primitives` -> `joins`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224934704
  
    --- Diff: sql/core/benchmarks/JoinBenchmark-results.txt ---
    @@ -0,0 +1,75 @@
    +================================================================================================
    +Join Benchmark
    +================================================================================================
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w long:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long wholestage off                    4464 / 4483          4.7         212.9       1.0X
    +Join w long wholestage on                      289 /  339         72.6          13.8      15.5X
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w long duplicated:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long duplicated wholestage off         5662 / 5678          3.7         270.0       1.0X
    +Join w long duplicated wholestage on           332 /  345         63.1          15.8      17.0X
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w 2 ints wholestage off              173174 / 173183          0.1        8257.6       1.0X
    +Join w 2 ints wholestage on               166350 / 198362          0.1        7932.2       1.0X
    --- End diff --
    
    +1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97090/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224374650
  
    --- Diff: sql/core/benchmarks/JoinBenchmark-results.txt ---
    @@ -0,0 +1,80 @@
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long wholestage off                    4062 / 4709          5.2         193.7       1.0X
    +Join w long wholestage on                      152 /  163        138.4           7.2      26.8X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long duplicated:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long duplicated wholestage off         3793 / 3801          5.5         180.9       1.0X
    +Join w long duplicated wholestage on           207 /  219        101.1           9.9      18.3X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w 2 ints wholestage off              138514 / 139178          0.2        6604.9       1.0X
    +Join w 2 ints wholestage on               129908 / 140869          0.2        6194.5       1.1X
    --- End diff --
    
    Ur, is this correct? Previously, we had the followings.
    ```
         *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
         *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97243/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97299/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224526143
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,161 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key") {
    +  def broadcastHashJoinTwoLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim3 = broadcast(sparkSession.range(M)
    +    val dim3 = broadcast(spark.range(M)
           .selectExpr("id as k1", "id as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 longs", N) {
    -      val df = sparkSession.range(N).join(dim3,
    +    codegenBenchmark("Join w 2 longs", N) {
    +      val df = spark.range(N).join(dim3,
             (col("id") % M) === col("k1") && (col("id") % M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs:                     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs codegen=false             5905 / 6123          3.6         281.6       1.0X
    -     *Join w 2 longs codegen=true              2230 / 2529          9.4         106.3       2.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key with duplicates") {
    +  def broadcastHashJoinTwoLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim4 = broadcast(sparkSession.range(M)
    +    val dim4 = broadcast(spark.range(M)
           .selectExpr("cast(id/10 as long) as k1", "cast(id/10 as long) as k2"))
     
    -    runBenchmark("Join w 2 longs duplicated", N) {
    -      val df = sparkSession.range(N).join(dim4,
    +    codegenBenchmark("Join w 2 longs duplicated", N) {
    +      val df = spark.range(N).join(dim4,
             (col("id") bitwiseAND M) === col("k1") && (col("id") bitwiseAND M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs duplicated:          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs duplicated codegen=false      6420 / 6587          3.3         306.1       1.0X
    -     *Join w 2 longs duplicated codegen=true      2080 / 2139         10.1          99.2       3.1X
    -     */
       }
     
    -  ignore("broadcast hash join, outer join long key") {
    +  def broadcastHashJoinOuterJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("outer join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "left")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("outer join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "left")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *outer join w long:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *outer join w long codegen=false          3055 / 3189          6.9         145.7       1.0X
    -     *outer join w long codegen=true            261 /  276         80.5          12.4      11.7X
    -     */
       }
     
    -  ignore("broadcast hash join, semi join long key") {
    +  def broadcastHashJoinSemiJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("semi join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("semi join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *semi join w long:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *semi join w long codegen=false           1912 / 1990         11.0          91.2       1.0X
    -     *semi join w long codegen=true             237 /  244         88.3          11.3       8.1X
    -     */
       }
     
    -  ignore("sort merge join") {
    +  def sortMergeJoin(): Unit = {
         val N = 2 << 20
    -    runBenchmark("merge join", N) {
    -      val df1 = sparkSession.range(N).selectExpr(s"id * 2 as k1")
    -      val df2 = sparkSession.range(N).selectExpr(s"id * 3 as k2")
    +    codegenBenchmark("merge join", N) {
    +      val df1 = spark.range(N).selectExpr(s"id * 2 as k1")
    +      val df2 = spark.range(N).selectExpr(s"id * 3 as k2")
           val df = df1.join(df2, col("k1") === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *merge join:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *merge join codegen=false                 1588 / 1880          1.3         757.1       1.0X
    -     *merge join codegen=true                  1477 / 1531          1.4         704.2       1.1X
    -     */
       }
     
    -  ignore("sort merge join with duplicates") {
    +  def sortMergeJoinWithDuplicates(): Unit = {
         val N = 2 << 20
    -    runBenchmark("sort merge join", N) {
    -      val df1 = sparkSession.range(N)
    +    codegenBenchmark("sort merge join with duplicates", N) {
    +      val df1 = spark.range(N)
             .selectExpr(s"(id * 15485863) % ${N*10} as k1")
    -      val df2 = sparkSession.range(N)
    +      val df2 = spark.range(N)
             .selectExpr(s"(id * 15485867) % ${N*10} as k2")
           val df = df1.join(df2, col("k1") === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *sort merge join:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *sort merge join codegen=false            3626 / 3667          0.6        1728.9       1.0X
    -     *sort merge join codegen=true             3405 / 3438          0.6        1623.8       1.1X
    -     */
       }
     
    -  ignore("shuffle hash join") {
    -    val N = 4 << 20
    -    sparkSession.conf.set("spark.sql.shuffle.partitions", "2")
    -    sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", "10000000")
    -    sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")
    -    runBenchmark("shuffle hash join", N) {
    -      val df1 = sparkSession.range(N).selectExpr(s"id as k1")
    -      val df2 = sparkSession.range(N / 3).selectExpr(s"id * 3 as k2")
    -      val df = df1.join(df2, col("k1") === col("k2"))
    -      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[ShuffledHashJoinExec]).isDefined)
    -      df.count()
    +  def shuffleHashJoin(): Unit = {
    +    val N: Long = 4 << 20
    +    withSQLConf(
    +      SQLConf.SHUFFLE_PARTITIONS.key -> "2",
    +      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "10000000",
    +      SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
    +      codegenBenchmark("shuffle hash join", N) {
    +        val df1 = spark.range(N).selectExpr(s"id as k1")
    +        val df2 = spark.range(N / 3).selectExpr(s"id * 3 as k2")
    +        val df = df1.join(df2, col("k1") === col("k2"))
    +        assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[ShuffledHashJoinExec]).isDefined)
    +        df.count()
    +      }
         }
    +  }
     
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
    -     *Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
    -     *shuffle hash join:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *shuffle hash join codegen=false          2005 / 2010          2.1         478.0       1.0X
    -     *shuffle hash join codegen=true           1773 / 1792          2.4         422.7       1.1X
    -     */
    +  override def runBenchmarkSuite(): Unit = {
    --- End diff --
    
    Could you wrap with something like `runBenchmark("Join Benchmark")`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97249/testReport)** for PR 22661 at commit [`00c4950`](https://github.com/apache/spark/commit/00c495091dfdfb9f647c0e66307b4cc8ef2a19a3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    @wangyum . Could you review and merge https://github.com/wangyum/spark/pull/18 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97279/testReport)** for PR 22661 at commit [`3be13b1`](https://github.com/apache/spark/commit/3be13b16f1a59ffbd158265f54ad4f8d511d2018).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224270597
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,165 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    --- End diff --
    
    So, this is a removal of redundant one, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3779/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97287/testReport)** for PR 22661 at commit [`3be13b1`](https://github.com/apache/spark/commit/3be13b16f1a59ffbd158265f54ad4f8d511d2018).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224714936
  
    --- Diff: core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala ---
    @@ -200,11 +200,12 @@ private[spark] object Benchmark {
       def getProcessorName(): String = {
         val cpu = if (SystemUtils.IS_OS_MAC_OSX) {
           Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
    +        .stripLineEnd
    --- End diff --
    
    Because the Mac has one more line than Linux:
    https://github.com/apache/spark/pull/22661/commits/28f9b9a8a26caf8750aa2e8c8e2bc793b3773d98#diff-45c96c65f7c46bc2d84843a7cb92f22fL7


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97080/testReport)** for PR 22661 at commit [`4339b1c`](https://github.com/apache/spark/commit/4339b1cbc5de7e54a7cd5be818fcf3dab249a351).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224270755
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,165 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    --- End diff --
    
    According to another bechmark case in this file, `broadcast` seems to be put outside of `codegenBenchmark`. How do you think about this? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224520773
  
    --- Diff: sql/core/benchmarks/JoinBenchmark-results.txt ---
    @@ -0,0 +1,80 @@
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long wholestage off                    4062 / 4709          5.2         193.7       1.0X
    +Join w long wholestage on                      152 /  163        138.4           7.2      26.8X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long duplicated:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long duplicated wholestage off         3793 / 3801          5.5         180.9       1.0X
    +Join w long duplicated wholestage on           207 /  219        101.1           9.9      18.3X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w 2 ints wholestage off              138514 / 139178          0.2        6604.9       1.0X
    +Join w 2 ints wholestage on               129908 / 140869          0.2        6194.5       1.1X
    --- End diff --
    
    Oh, interesting. Although it's beyond the scope, could you run on `branch-2.4` and `branch-2.3` please, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22661


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224767594
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
    - * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * Benchmark to measure performance for joins.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
    --- End diff --
    
    This seems caused by the bug fix: https://github.com/apache/spark/pull/15390
    
    So the performance is reasonable. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97299/testReport)** for PR 22661 at commit [`28f9b9a`](https://github.com/apache/spark/commit/28f9b9a8a26caf8750aa2e8c8e2bc793b3773d98).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224396901
  
    --- Diff: sql/core/benchmarks/JoinBenchmark-results.txt ---
    @@ -0,0 +1,80 @@
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long wholestage off                    4062 / 4709          5.2         193.7       1.0X
    +Join w long wholestage on                      152 /  163        138.4           7.2      26.8X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w long duplicated:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long duplicated wholestage off         3793 / 3801          5.5         180.9       1.0X
    +Join w long duplicated wholestage on           207 /  219        101.1           9.9      18.3X
    +
    +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    +
    +Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w 2 ints wholestage off              138514 / 139178          0.2        6604.9       1.0X
    +Join w 2 ints wholestage on               129908 / 140869          0.2        6194.5       1.1X
    --- End diff --
    
    I think it's correct, I ran it on master:
    ```
    build/sbt "sql/test-only *benchmark.JoinBenchmark"
    ......
    [info] JoinBenchmark:
    [info] - broadcast hash join, long key !!! IGNORED !!!
    [info] - broadcast hash join, long key with duplicates !!! IGNORED !!!
    Running benchmark: Join w 2 ints
      Running case: Join w 2 ints wholestage off
      Stopped after 2 iterations, 307335 ms
      Running case: Join w 2 ints wholestage on
      Stopped after 5 iterations, 687107 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    
    Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Join w 2 ints wholestage off              153532 / 153668          0.1        7321.0       1.0X
    Join w 2 ints wholestage on               132075 / 137422          0.2        6297.8       1.2X
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97243/testReport)** for PR 22661 at commit [`2baaf35`](https://github.com/apache/spark/commit/2baaf35a89d2cd5f70a0c21c05c392af7affb403).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224768758
  
    --- Diff: sql/core/benchmarks/JoinBenchmark-results.txt ---
    @@ -0,0 +1,75 @@
    +================================================================================================
    +Join Benchmark
    +================================================================================================
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w long:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long wholestage off                    4464 / 4483          4.7         212.9       1.0X
    +Join w long wholestage on                      289 /  339         72.6          13.8      15.5X
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w long duplicated:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w long duplicated wholestage off         5662 / 5678          3.7         270.0       1.0X
    +Join w long duplicated wholestage on           332 /  345         63.1          15.8      17.0X
    +
    +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
    +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
    +Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +------------------------------------------------------------------------------------------------
    +Join w 2 ints wholestage off              173174 / 173183          0.1        8257.6       1.0X
    +Join w 2 ints wholestage on               166350 / 198362          0.1        7932.2       1.0X
    --- End diff --
    
    this surprises me that whole stage codegen doesn't help. We should investigate it later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224300031
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,165 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
      * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    --- End diff --
    
    Yes


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224934912
  
    --- Diff: core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala ---
    @@ -200,11 +200,12 @@ private[spark] object Benchmark {
       def getProcessorName(): String = {
         val cpu = if (SystemUtils.IS_OS_MAC_OSX) {
           Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
    +        .stripLineEnd
    --- End diff --
    
    Ur.. I'm not a fan to piggy-backing. Okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3875/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224676911
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
    - * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * Benchmark to measure performance for joins.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key") {
    +  def broadcastHashJoinTwoLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim3 = broadcast(sparkSession.range(M)
    +    val dim3 = broadcast(spark.range(M)
           .selectExpr("id as k1", "id as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 longs", N) {
    -      val df = sparkSession.range(N).join(dim3,
    +    codegenBenchmark("Join w 2 longs", N) {
    +      val df = spark.range(N).join(dim3,
             (col("id") % M) === col("k1") && (col("id") % M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs:                     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs codegen=false             5905 / 6123          3.6         281.6       1.0X
    -     *Join w 2 longs codegen=true              2230 / 2529          9.4         106.3       2.6X
    -     */
       }
     
    -  ignore("broadcast hash join, two long key with duplicates") {
    +  def broadcastHashJoinTwoLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim4 = broadcast(sparkSession.range(M)
    +    val dim4 = broadcast(spark.range(M)
           .selectExpr("cast(id/10 as long) as k1", "cast(id/10 as long) as k2"))
     
    -    runBenchmark("Join w 2 longs duplicated", N) {
    -      val df = sparkSession.range(N).join(dim4,
    +    codegenBenchmark("Join w 2 longs duplicated", N) {
    +      val df = spark.range(N).join(dim4,
             (col("id") bitwiseAND M) === col("k1") && (col("id") bitwiseAND M) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 longs duplicated:          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 longs duplicated codegen=false      6420 / 6587          3.3         306.1       1.0X
    -     *Join w 2 longs duplicated codegen=true      2080 / 2139         10.1          99.2       3.1X
    -     */
       }
     
    -  ignore("broadcast hash join, outer join long key") {
    +  def broadcastHashJoinOuterJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("outer join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "left")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("outer join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "left")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *outer join w long:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *outer join w long codegen=false          3055 / 3189          6.9         145.7       1.0X
    -     *outer join w long codegen=true            261 /  276         80.5          12.4      11.7X
    -     */
       }
     
    -  ignore("broadcast hash join, semi join long key") {
    +  def broadcastHashJoinSemiJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("semi join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("semi join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"), "leftsemi")
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *semi join w long:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *semi join w long codegen=false           1912 / 1990         11.0          91.2       1.0X
    -     *semi join w long codegen=true             237 /  244         88.3          11.3       8.1X
    -     */
       }
     
    -  ignore("sort merge join") {
    +  def sortMergeJoin(): Unit = {
         val N = 2 << 20
    -    runBenchmark("merge join", N) {
    -      val df1 = sparkSession.range(N).selectExpr(s"id * 2 as k1")
    -      val df2 = sparkSession.range(N).selectExpr(s"id * 3 as k2")
    +    codegenBenchmark("merge join", N) {
    --- End diff --
    
    `merge join` -> `sort merge join`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97090/testReport)** for PR 22661 at commit [`4859a9f`](https://github.com/apache/spark/commit/4859a9f5e78edf81c211c304a57e2603e60b2cc7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97243/testReport)** for PR 22661 at commit [`2baaf35`](https://github.com/apache/spark/commit/2baaf35a89d2cd5f70a0c21c05c392af7affb403).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97080/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224685493
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
    - * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * Benchmark to measure performance for joins.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
    --- End diff --
    
    Any advice is welcome and thank you in advance, @cloud-fan , @gatorsmile , @davies , @rxin .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224678241
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
    - * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * Benchmark to measure performance for joins.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
    --- End diff --
    
    For now, I also cannot get a consistent result like above. I mean I got the same weird result like you. Let me take a look this more.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97287/testReport)** for PR 22661 at commit [`3be13b1`](https://github.com/apache/spark/commit/3be13b16f1a59ffbd158265f54ad4f8d511d2018).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97249/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3878/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22661#discussion_r224934660
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala ---
    @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark
     
     import org.apache.spark.sql.execution.joins._
     import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.types.IntegerType
     
     /**
    - * Benchmark to measure performance for aggregate primitives.
    - * To run this:
    - *  build/sbt "sql/test-only *benchmark.JoinBenchmark"
    - *
    - * Benchmarks in this file are skipped in normal builds.
    + * Benchmark to measure performance for joins.
    + * To run this benchmark:
    + * {{{
    + *   1. without sbt:
    + *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
    + *   2. build/sbt "sql/test:runMain <this class>"
    + *   3. generate result:
    + *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    + *      Results will be written to "benchmarks/JoinBenchmark-results.txt".
    + * }}}
      */
    -class JoinBenchmark extends BenchmarkWithCodegen {
    +object JoinBenchmark extends SqlBasedBenchmark {
     
    -  ignore("broadcast hash join, long key") {
    +  def broadcastHashJoinLongKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
     
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long", N) {
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v"))
    +    codegenBenchmark("Join w long", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -    Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -    Join w long:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -    -------------------------------------------------------------------------------------------
    -    Join w long codegen=false                3002 / 3262          7.0         143.2       1.0X
    -    Join w long codegen=true                  321 /  371         65.3          15.3       9.3X
    -    */
       }
     
    -  ignore("broadcast hash join, long key with duplicates") {
    +  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -
    -    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))
    -    runBenchmark("Join w long duplicated", N) {
    -      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
    -      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
    +    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))
    +    codegenBenchmark("Join w long duplicated", N) {
    +      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w long duplicated:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w long duplicated codegen=false      3446 / 3478          6.1         164.3       1.0X
    -     *Join w long duplicated codegen=true       322 /  351         65.2          15.3      10.7X
    -     */
       }
     
    -  ignore("broadcast hash join, two int key") {
    +  def broadcastHashJoinTwoIntKey(): Unit = {
         val N = 20 << 20
         val M = 1 << 16
    -    val dim2 = broadcast(sparkSession.range(M)
    +    val dim2 = broadcast(spark.range(M)
           .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v"))
     
    -    runBenchmark("Join w 2 ints", N) {
    -      val df = sparkSession.range(N).join(dim2,
    +    codegenBenchmark("Join w 2 ints", N) {
    +      val df = spark.range(N).join(dim2,
             (col("id") % M).cast(IntegerType) === col("k1")
               && (col("id") % M).cast(IntegerType) === col("k2"))
           assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
           df.count()
         }
    -
    -    /*
    -     *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
    -     *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -     *Join w 2 ints:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -     *-------------------------------------------------------------------------------------------
    -     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
    -     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
    -     */
    --- End diff --
    
    Thank you for confirmation, @cloud-fan !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3771/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97249/testReport)** for PR 22661 at commit [`00c4950`](https://github.com/apache/spark/commit/00c495091dfdfb9f647c0e66307b4cc8ef2a19a3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    **[Test build #97080 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97080/testReport)** for PR 22661 at commit [`4339b1c`](https://github.com/apache/spark/commit/4339b1cbc5de7e54a7cd5be818fcf3dab249a351).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22661
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org