You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dilipbiswal <gi...@git.apache.org> on 2018/10/05 08:04:14 UTC

[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/22638

    [SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache UDF result correctly

    ## What changes were proposed in this pull request?
    In this test case, we are verifying that the result of an UDF  is cached when the underlying data frame is cached and that the udf is not evaluated again when the cached data frame is used.
    
    To reduce the runtime we do : 
    1) Use a single partition dataframe, so the total execution time of UDF is more deterministic.
    2) Cut down the size of the dataframe from 10 to 2.
    3) Reduce the sleep time in the UDF from 5secs to 2secs.
    4) Reduce the failafter condition from 3 to 2.
    
    With the above change, it takes about 4 secs to cache the first dataframe. And subsequent check takes a few hundred milliseconds.
    The new runtime for 5 consecutive runs of this test is as follows : 
    ```
    [info] - cache UDF result correctly (4 seconds, 906 milliseconds)
    [info] - cache UDF result correctly (4 seconds, 281 milliseconds)
    [info] - cache UDF result correctly (4 seconds, 288 milliseconds)
    [info] - cache UDF result correctly (4 seconds, 355 milliseconds)
    [info] - cache UDF result correctly (4 seconds, 280 milliseconds)
    ```
    ## How was this patch tested?
    This is s test fix.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark SPARK-25610

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22638
    
----
commit 97d18feeb8713da42b9f97d2343063bac1cba4b6
Author: Dilip Biswal <db...@...>
Date:   2018-10-05T07:51:35Z

    [SPARK-25610][TEST] Improve execution time of DatasetCacheSuite: cache UDF result correctly

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22638#discussion_r222952924
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits
       }
     
       test("cache UDF result correctly") {
    -    val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
    -    val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
    +    val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
    --- End diff --
    
    well, I do think this will pass 100% times, my concern was that in case of a regression we might fail detecting it. But yes, with the repartition to 1 you're right, I haven't considered it, otherwise they may have run in parallel. So this seems enough.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22638#discussion_r222945036
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits
       }
     
       test("cache UDF result correctly") {
    -    val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
    -    val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
    +    val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
    --- End diff --
    
    mmh...since we fail after 2 seconds we may pass this even in case it doesn't work. Shall we put it to 3? or 2500 at least?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3703/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22638#discussion_r222957505
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits
       }
     
       test("cache UDF result correctly") {
    -    val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
    -    val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
    +    val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
    --- End diff --
    
    @mgaido91 Thanks marco.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    **[Test build #96980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96980/testReport)** for PR 22638 at commit [`97d18fe`](https://github.com/apache/spark/commit/97d18feeb8713da42b9f97d2343063bac1cba4b6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    Thanks a lot @gatorsmile @mgaido91 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    **[Test build #96980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96980/testReport)** for PR 22638 at commit [`97d18fe`](https://github.com/apache/spark/commit/97d18feeb8713da42b9f97d2343063bac1cba4b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22638


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    cc @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96980/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22638: [SPARK-25610][SQL][TEST] Improve execution time of Datas...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22638
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22638: [SPARK-25610][SQL][TEST] Improve execution time o...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22638#discussion_r222946564
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -127,16 +127,16 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits
       }
     
       test("cache UDF result correctly") {
    -    val expensiveUDF = udf({x: Int => Thread.sleep(5000); x})
    -    val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
    +    val expensiveUDF = udf({x: Int => Thread.sleep(2000); x})
    --- End diff --
    
    @mgaido91 OK, please correct me on this one. So we insert 2 rows .. i.e two invocation of the UDF amounting to 2 * 2sec  = 4 secs of execution. So wouldn't a 2 sec fail time be ok ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org