You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/10/04 15:53:38 UTC

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22631

    [SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of timezones

    ## What changes were proposed in this pull request?
    
    The test `cast string to timestamp` used to run for all time zones. So it run for more than 600 times. Running the tests for a significant subset of time zones is probably good enough and doing this in a randomized manner enforces anyway that we are going to test all time zones in different runs.
    
    ## How was this patch tested?
    
    the test time reduces to 11 seconds from more than 2 minutes


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-25605

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22631.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22631
    
----
commit 48534796d8bd9d2fcd75b4df77149f6b89341dde
Author: Marco Gaido <ma...@...>
Date:   2018-10-04T15:47:38Z

    [SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of timezones

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3681/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223314632
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    > we should categorize timezones, pick up some timezones representing them and test fixed set
    
    That would be the best, but we need some deep understanding of timezone, to make sure the test coverage is good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    `JsonExpressionsSuite` has a test for all time zones too. Probably it makes sense to apply the same approach there:
    https://github.com/apache/spark/blob/1007cae20e8f566e7d7c25f0f81c9b84f352b6d5/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L506
    
    I guess number of time zones can be reduced in `DateTimeUtilsSuite` too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22631


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    **[Test build #96945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96945/testReport)** for PR 22631 at commit [`4853479`](https://github.com/apache/spark/commit/48534796d8bd9d2fcd75b4df77149f6b89341dde).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223292455
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    I don't think that adding parallelism is a good way for improve test time. The amount of resources used for testing is anyway limited. I think the goal here is not (only) reduce the overall time of the test but also reduce the amount of resources needed to test.
    
    Problems with a specific timezone like you mentioned, @HyukjinKwon, are exactly the reason why I am proposing this randomized approach, rather than picking 3 timezones and always use them as done in  `DateExpressionsSuite`: if there is a problem with a specific timezone, in this way, we will be able to catch it. With a fixed subset of them (even though not on the single run), we are not.
    
    The only safe deterministic way would be to run against all of them, as it was done before, but then I'd argue that we should do the same everywhere we have different timezones involved in tests (why are we testing all timezones only for casting to timestamp and not for all other functions involving dates and times, if it is so important to check all timezones?). But then the amount of time needed to run all the tests would be crazy, so it is not doable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    **[Test build #96945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96945/testReport)** for PR 22631 at commit [`4853479`](https://github.com/apache/spark/commit/48534796d8bd9d2fcd75b4df77149f6b89341dde).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223224573
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    Tests should be deterministic, ideally; any sources of randomness should be seeded. Do you see one that isn't? 
    
    I think this is like deciding we'll run just 90% of all test suites every time randomly, to save time. I think it's just well against good practice.
    
    There are other solutions:
    1) pick a subset of timezones that we're confident do exercise the code and just explicitly test those
    2) parallelize these tests within the test suite
    
    The latter should be trivial in this case: `ALL_TIMEZONES.par.foreach { tz =>` instead. It's the same amount of work but 8x, 16x faster by wall clock time, depending on how many cores are available. What about that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223228194
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    Surely not by design? tests need to be deterministic, or else what's the value? failures can't be reproduced. (I know that in practice many things are hard to make deterministic.)
    
    Of course, if you're worried that we might not be testing an important case, we have to test it. We can't just not test it sometimes to make some tests run a little faster.
    
    Testing just 3 timezones might be fine too; I don't know. Testing 50 randomly seems suboptimal in all cases. 
    
    I'll open a PR to try simply testing in parallel instead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223190476
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    @mgaido91 @gatorsmile surely we shouldn't do this. We're introducing nondeterminism into the test, and also no longer testing things we were testing before. If there's a lot of timezones we don't need to test, then, don't test them. Right? I didn't see this PR but would have objected to this change if I had seen it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96945/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223202913
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    @srowen I think the point here is to test that this works with different timezones. In `DateExpressionsSuite`, for instance, we test only 3-4 timezones. I don't think it makes sense to test every of the 650 possible timezones: if it works with many of them, then it means that the code is generic and respects timezones. We can also define a fixed subset of timezones, but IMHO taking randomly some of them provides the additional safety that if there is a specific single timezone creating problem, we are able to identify it on several subsequent runs.
    
    We have many place where we generate data randomly in the test, so we already have randomness in the tests. I think the rationale behind them is the same: if the function works with some different data, then it generalize properly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223286771
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    I mean some tests like with randomized input, let's say, integer range input are fine in common sense but this case is different, isn't it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223285813
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    I think tests need to be deterministic in general as well.
    
    In this particular case ideally, we should categorize timezones and test fixed set. For instance, timezone with DST, without DST, and some exceptions such as, for instance, see this particular case which Python 3.6 addressed lately (https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/datetime.py#L1572-L1574), IMHO.
    
    Of course, this approach requires a lot of investigations and overheads. So, as an alternative, I would incline to go for Sean's approach (https://github.com/apache/spark/pull/22631/files#r223224573) for this particular case.
    
    For randomness, I think primarily we should have first deterministic set of tests. Maybe we could additionally have some set of randomized input to cover some cases we haven't foreseen but that's secondary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223354745
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    We aren't blocked on CPU time or resources, no. The tests are mostly single-threaded, and the big test machines mostly idle throughout the runs. I've opened https://github.com/apache/spark/pull/22672 to evaluate that.
    
    I feel strongly that we can't do this kind of thing and am opening a thread on dev@ for wider discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223226770
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
    Yes, there are many tests where data is randomly generated. And they are not seeded of course.
    
    As I said, I think the goal here is to test that the function works well with different timezones: then picking a subset of timezones would be fine too, but I prefer taking them randomly among all because if there is a single timezone creating issues (very unlikely IMHO), we would discover it anyway (not on the single run though).
    
    Moreover, it would be great then to be consistent among all the codebase on what we test. In `DateExpressionsSuite` we test only 3 timezones and here we test all 650: it is a weird, isn't it? We should probably define which is the right thing to do when timezones are involved and test always the same. Otherwise, testing 650 timezones on a single specific function and 3 on the most of the others is a nonsense IMHO.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22631: [SPARK-25605][TESTS] Run cast string to timestamp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22631#discussion_r223253381
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala ---
    @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
       }
     
       test("cast string to timestamp") {
    -    for (tz <- ALL_TIMEZONES) {
    +    for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
    --- End diff --
    
     > tests need to be deterministic, or else what's the value? failures can't be reproduced
    
    That's not true. AFAIK we have a lot of tests that generate data randomly, and if it fails, the test name will include the seed(or the generated data), so people can easily reproduce it.
    
    I think it's a good strategy to test a subset of possible cases, to make a tradeoff between how soon we can discover a bug, and how fast we can iterate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22631: [SPARK-25605][TESTS] Run cast string to timestamp tests ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22631
  
    cc @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org