You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by inouehrs <gi...@git.apache.org> on 2016/06/02 03:11:53 UTC

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

GitHub user inouehrs opened a pull request:

    https://github.com/apache/spark/pull/13459

    [SPARK-15726] [SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD

    ## What changes were proposed in this pull request?
    
    DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations.
    In backToBackMap test case, however, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads. After the fix RDD outperforms DataFrame, while DataFrame was more than 2x faster than RDD without the fix. Also, the performance gap between DataFrame and Dataset becomes much narrower.
    
    Of course, this issue does not affect Spark users, but it may confuse Spark developers.
    
    ```scala
    // DataFrame
    val df = spark.range(1, numRows).select($"id".as("l"), $"id".cast(StringType).as("s"))
    var res = df
    res = res.select($"l" + 1 as "l")
    // this should be res = res.select($"l" + 1 as "l", $"s") for fair comparison
    
    // Dataset 
    case class Data(l: Long, s: String)
    val func = (d: Data) => Data(d.l + 1, d.s)
    var res = df.as[Data]
    res = res.map(func)
    ```
    
    Additionally, I added a new test case named "back-to-back map for primitive". This is almost equivalent with the old behavior of the DataFrame implementation of back-to-back map.
    
    ```
    without fix
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    RDD                                           2051 / 2077         48.7          20.5       1.0X
    DataFrame                                      755 /  940        132.5           7.5       2.7X
    Dataset                                       6155 / 6680         16.2          61.6       0.3X
    
    with fix
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    RDD                                           2077 / 2259         48.1          20.8       1.0X
    DataFrame                                     3030 / 3310         33.0          30.3       0.7X
    Dataset                                       6504 / 7006         15.4          65.0       0.3X
    
    back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    RDD                                           1073 / 1509         93.2          10.7       1.0X
    DataFrame                                      763 /  913        131.0           7.6       1.4X
    Dataset                                       4189 / 4312         23.9          41.9       0.3X
    ```
    
    Note that DatasetBenchmark causes JVM crash in an aggregate test case. This is not related to this issue.
    I already created a jira entry and submited a pull request for this issue.
    https://issues.apache.org/jira/browse/SPARK-15704
    https://github.com/apache/spark/pull/13446
    
    
    ## How was this patch tested?
    By executing the DatasetBenchmark


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/inouehrs/spark fix_benchmark_fairness

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13459
    
----
commit ca21e673916ecc7c51f24ddfef8f748bade2e11b
Author: Hiroshi Inoue <in...@jp.ibm.com>
Date:   2016-05-31T07:17:07Z

    make backToBackMap of DatasetBenchmark fair
    
    In the original implementation, DataFrame version processes  less work
    than RDD and Dataset versions.

commit f1e49f35e03c1d69662273b91b67b734d95b117b
Author: Hiroshi Inoue <in...@jp.ibm.com>
Date:   2016-05-31T07:24:09Z

    add backToBackMapPrimitive in DatasetBenchmark
    
    The backToBackMapPrimitive is almost equivalent with the original
    backToBackMap implementation for DataFrame

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    I found another bug that significantly affects the performance comparisons.
    In _back-to-back map_ and _back-to-back filter_, map or filter operation is executed only once regardless of `numChains` for RDD. Hence the execution times for RDD have been largely underestimated.
    
    Regarding the `toRdd` issue pointed out by @rxin, `toRdd` returns RDD of `InternalRow` (not RDD of `Data`) and hence I think it does not cause significant overhead of a conversion.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r73395438
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -43,7 +43,7 @@ object DatasetBenchmark {
           var res = rdd
           var i = 0
           while (i < numChains) {
    -        res = rdd.map(func)
    --- End diff --
    
    Here, obviously, this should be `res = res.map(func) `.
    The original code applies map only once to the input `rdd` regardless of `numChains`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    hm i think the test is actually wrong for DataFrame. It forces a conversion to RDD, which is unnecessary and probably dominates the time.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    LGTM, pending jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    **[Test build #63253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63253/consoleFull)** for PR 13459 at commit [`1866ae4`](https://github.com/apache/spark/commit/1866ae4aab89e29577b27f38315ff26bd3bb636c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r73467020
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    --- End diff --
    
    This new benchmark looks reasonable, but it's out of the scope of the JIRA. Can you open a new PR to add it? Other changes in this PR LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65575316
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -166,21 +207,33 @@ object DatasetBenchmark {
         val numChains = 10
     
         val benchmark = backToBackMap(spark, numRows, numChains)
    -    val benchmark2 = backToBackFilter(spark, numRows, numChains)
    -    val benchmark3 = aggregate(spark, numRows)
    +    val benchmark2 = backToBackMapPrimitive(spark, numRows, numChains)
    +    val benchmark3 = backToBackFilter(spark, numRows, numChains)
    +    val benchmark4 = aggregate(spark, numRows)
     
         /*
    -    Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
    -    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    --- End diff --
    
    since the hardware is different, I think you should rerun the whole benchmark and update the result.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    I'm not sure how to trigger a test manually, will check with my colleague later


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    @cloud-fan Thank you for the comment. I removed the backToBackMapPrimitive benchmark from this PR. I am going to open a new PR for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65497161
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    +    import spark.implicits._
    +
    +    val df = spark.range(1, numRows).select($"id".as("l"))
    +    val benchmark = new Benchmark("back-to-back map for primitive", numRows)
    +    val func = (d: Long) => d+1
    +
    +    val rdd = spark.sparkContext.range(1, numRows).map(l => l.toLong)
    +    benchmark.addCase("RDD") { iter =>
    +      var res = rdd
    +      var i = 0
    +      while (i < numChains) {
    +        res = rdd.map(func)
    +        i += 1
    +      }
    +      res.foreach(_ => Unit)
    +    }
    +
    +    benchmark.addCase("DataFrame") { iter =>
    +      var res = df
    +      var i = 0
    +      while (i < numChains) {
    +        res = res.select($"l" + 1 as "l")
    +        i += 1
    +      }
    +      res.queryExecution.toRdd.foreach(_ => Unit)
    --- End diff --
    
    this is what i'm talking about -- this forces a conversion, which is very expensive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    **[Test build #63253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63253/consoleFull)** for PR 13459 at commit [`1866ae4`](https://github.com/apache/spark/commit/1866ae4aab89e29577b27f38315ff26bd3bb636c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65575045
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    +    import spark.implicits._
    +
    +    val df = spark.range(1, numRows).select($"id".as("l"))
    +    val benchmark = new Benchmark("back-to-back map for primitive", numRows)
    +    val func = (d: Long) => d+1
    +
    +    val rdd = spark.sparkContext.range(1, numRows).map(l => l.toLong)
    +    benchmark.addCase("RDD") { iter =>
    +      var res = rdd
    +      var i = 0
    +      while (i < numChains) {
    +        res = rdd.map(func)
    +        i += 1
    +      }
    +      res.foreach(_ => Unit)
    +    }
    +
    +    benchmark.addCase("DataFrame") { iter =>
    +      var res = df
    +      var i = 0
    +      while (i < numChains) {
    +        res = res.select($"l" + 1 as "l")
    +        i += 1
    +      }
    +      res.queryExecution.toRdd.foreach(_ => Unit)
    --- End diff --
    
    hmmm, I think `queryExecution.toRdd` won't force a conversion, `.rdd` will


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    I found that `count` does not make sense for this purpose since `map` does not change the number of elements and so the codegen can ignore map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    From the participants list, it looks like @SparkQA is not checking this PR. Is there anyway to call attention of the jenkins system?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    @cloud-fan, I updated the description. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13459


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    I updated the description again. After fixing the second issue, the performance numbers were no longer valid. Thank you for pointing this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    @rxin Thank you for the suggestion.
    When `res.count` is used instead of `res.queryExecution.toRdd.foreach(_ => Unit)`, the execution times become much shorter as shown below. Especially, the DataFrame performances are impressive.
    In this case, the overhead of the conversion to RDD is replaced by the aggregation overhead.
    When I used `res.foreach(_ => Unit)` instead of `res.queryExecution.toRdd.foreach(_ => Unit)`, the performance was degraded.
    I am going to add these aggregate versions of tests in my pull request.
    Do you have any suggestion on an action to use here instead of `count`.
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    RDD                                           2118 / 2300         47.2          21.2       1.0X
    DataFrame                                      172 /  280        582.3           1.7      12.3X
    Dataset                                       4895 / 4999         20.4          49.0       0.4X
    
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    RDD                                            883 / 1150        113.2           8.8       1.0X
    DataFrame                                      110 /  121        905.2           1.1       8.0X
    Dataset                                       3880 / 3915         25.8          38.8       0.2X
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63253/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r73466753
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -43,7 +43,7 @@ object DatasetBenchmark {
           var res = rdd
           var i = 0
           while (i < numChains) {
    -        res = rdd.map(func)
    --- End diff --
    
    ah, good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    > After the fix RDD outperforms DataFrame, while DataFrame was more than 2x faster than RDD without the fix. Also, the performance gap between DataFrame and Dataset becomes much narrower.
    
    It's not true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65574835
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    +    import spark.implicits._
    +
    +    val df = spark.range(1, numRows).select($"id".as("l"))
    +    val benchmark = new Benchmark("back-to-back map for primitive", numRows)
    +    val func = (d: Long) => d+1
    --- End diff --
    
    nit: `d + 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by inouehrs <gi...@git.apache.org>.

Github user inouehrs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65584459
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    +    import spark.implicits._
    +
    +    val df = spark.range(1, numRows).select($"id".as("l"))
    +    val benchmark = new Benchmark("back-to-back map for primitive", numRows)
    +    val func = (d: Long) => d+1
    --- End diff --
    
    Oops. Thank you for pointing this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13459#discussion_r65497164
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
    @@ -72,6 +72,47 @@ object DatasetBenchmark {
         benchmark
       }
     
    +  def backToBackMapPrimitive(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = {
    +    import spark.implicits._
    +
    +    val df = spark.range(1, numRows).select($"id".as("l"))
    +    val benchmark = new Benchmark("back-to-back map for primitive", numRows)
    +    val func = (d: Long) => d+1
    +
    +    val rdd = spark.sparkContext.range(1, numRows).map(l => l.toLong)
    +    benchmark.addCase("RDD") { iter =>
    +      var res = rdd
    +      var i = 0
    +      while (i < numChains) {
    +        res = rdd.map(func)
    +        i += 1
    +      }
    +      res.foreach(_ => Unit)
    +    }
    +
    +    benchmark.addCase("DataFrame") { iter =>
    +      var res = df
    +      var i = 0
    +      while (i < numChains) {
    +        res = res.select($"l" + 1 as "l")
    +        i += 1
    +      }
    +      res.queryExecution.toRdd.foreach(_ => Unit)
    +    }
    +
    +    benchmark.addCase("Dataset") { iter =>
    +      var res = df.as[Long]
    +      var i = 0
    +      while (i < numChains) {
    +        res = res.map(func)
    +        i += 1
    +      }
    +      res.queryExecution.toRdd.foreach(_ => Unit)
    --- End diff --
    
    same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    Hi @inouehrs , can you update the PR description?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org