You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2018/08/22 04:55:04 UTC

[GitHub] spark pull request #22182: [SPARK-25184][SS] Fixed race condition in StreamE...

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/22182

    [SPARK-25184][SS] Fixed race condition in StreamExecution that caused flaky test in FlatMapGroupsWithState

    ## What changes were proposed in this pull request?
    
    The race condition that caused test failure is between 2 threads.
    - The MicrobatchExecution thread that processes inputs to produce answers and then generates progress events.
    - The test thread that generates some input data, checked the answer and then verified the query generated progress event.
    
    The synchronization structure between these threads is as follows
    1. MicrobatchExecution thread, in every batch, does the following in order.
       a. Processes batch input to generate answer.
       b. Signals `awaitProgressLockCondition` to wake up threads waiting for progress using `awaitOffset`
       c. Generates progress event
    
    2. Test execution thread
       a. Calls `awaitOffset` to wait for progress, which waits on `awaitProgressLockCondition`.
       b. As soon as `awaitProgressLockCondition` is signaled, it would move on the in the test to check answer.
      c. Finally, it would verify the last generated progress event.
    
    What can happen is the following sequence of events: 2a -> 1a -> 1b -> 2b -> 2c -> 1c.
    In other words, the progress event may be generated after the test tries to verify it.
    
    The solution has two steps.
    1. Signal the waiting thread after the progress event has been generated, that is, after `finishTrigger()`.
    2. Increase the timeout of `awaitProgressLockCondition.await(100 ms)` to a large value.
    
    This latter is to ensure that test thread for keeps waiting on `awaitProgressLockCondition`until the MicroBatchExecution thread explicitly signals it. With the existing small timeout of 100ms the following sequence can occur.
     - MicroBatchExecution thread updates committed offsets
     - Test thread waiting on `awaitProgressLockCondition` accidentally times out after 100 ms, finds that the committed offsets have been updated, therefore returns from `awaitOffset` and moves on to the progress event tests.
     - MicroBatchExecution thread then generates progress event and signals. But the test thread has already attempted to verify the event and failed.
    
    By increasing the timeout to large (e.g., `streamingTimeoutMs = 60 seconds`, similar to `awaitInitialization`), this above type of race condition is also avoided.
    
    ## How was this patch tested?
    Ran locally many times.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-25184

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22182
    
----
commit 319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9
Author: Tathagata Das <ta...@...>
Date:   2018-08-22T04:44:59Z

    [SC-12136][SS][HOTFIX] Fixed race condition in StreamExecution that caused flaky test in FlatMapGroupsWithState
    
    The race condition that caused test failure is between 2 threads.
    - The MicrobatchExecution thread that processes inputs to produce answers and then generates progress events.
    - The test thread that generates some input data, checked the answer and then verified the query generated progress event.
    
    The synchronization structure between these threads is as follows
    1. MicrobatchExecution thread, in every batch, does the following in order.
       a. Processes batch input to generate answer.
       b. Signals `awaitProgressLockCondition` to wake up threads waiting for progress using `awaitOffset`
       c. Generates progress event
    
    2. Test execution thread
       a. Calls `awaitOffset` to wait for progress, which waits on `awaitProgressLockCondition`.
       b. As soon as `awaitProgressLockCondition` is signaled, it would move on the in the test to check answer.
      c. Finally, it would verify the last generated progress event.
    
    What can happen is the following sequence of events: 2a -> 1a -> 1b -> 2b -> 2c -> 1c.
    In other words, the progress event may be generated after the test tries to verify it.
    
    The solution has two steps.
    1. Signal the waiting thread after the progress event has been generated, that is, after `finishTrigger()`.
    2. Increase the timeout of `awaitProgressLockCondition.await(100 ms)` to a large value.
    
    This latter is to ensure that test thread for keeps waiting on `awaitProgressLockCondition`until the MicroBatchExecution thread explicitly signals it. With the existing small timeout of 100ms the following sequence can occur.
     - MicroBatchExecution thread updates committed offsets
     - Test thread waiting on `awaitProgressLockCondition` accidentally times out after 100 ms, finds that the committed offsets have been updated, therefore returns from `awaitOffset` and moves on to the progress event tests.
     - MicroBatchExecution thread then generates progress event and signals. But the test thread has already attempted to verify the event and failed.
    
    By increasing the timeout to large (e.g., `streamingTimeoutMs = 60 seconds`, similar to `awaitInitialization`), this above type of race condition is also avoided.
    
    Ran locally many times.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    If this PR needs to be warmfixed (i.e. merged into the release branch after the code freeze), please follow steps below.
    
    What type of warmfix is this? Please select **exactly one choice**, or write description in Other.
    
    - [ ] Regression (e.g. fixing the behavior of a feature that regressed in the current release cycle)
    - [ ] ES ticket fix (e.g. Customer or internally requested update/fix)
    - [x] Other (please describe): test flakiness in 4.x and 4.3 branches.
    
    Make the following updates to this PR:
    
    - [x] Add `[WARMFIX]` in the title of this PR.
    - [x] Label the PR using label(s) corrsponding to the WARMFIX branch(es). The label name should be in the format of `dbr-branch-a.b` (e.g. `dbr-branch-3.2`), which matches the release branch name for Runtime release `a.b`.
    - [ ] Ask your team lead to sign off this warmfix and add the `warmfix-approved` label.
    - [ ] When merging the PR using the merge script, make sure to get this PR merged into the following branches:
    
      - The branch against which your PR is opened, and
      - Any extra release branch(es) corresponding to the `dbr-branch-a.b` label(s) applied to your PR.
    
    Closes #3166 from tdas/SC-12136.
    
    Authored-by: Tathagata Das <ta...@gmail.com>
    Signed-off-by: Tathagata Das <ta...@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95084/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2424/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95098/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95084/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95090/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95090/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2430/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22182: [SPARK-25184][SS] Fixed race condition in StreamE...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22182


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95090/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2438/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95098/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95084/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22182
  
    **[Test build #95098 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95098/testReport)** for PR 22182 at commit [`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org