You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jose-torres <gi...@git.apache.org> on 2018/02/20 23:48:03 UTC

[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...

GitHub user jose-torres opened a pull request:

    https://github.com/apache/spark/pull/20646

    [SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest.

    ## What changes were proposed in this pull request?
    
    The stream-stream join tests add data to multiple sources, and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.
    
    Fortunately, MemoryStream synchronizes batch generation on itself, and StreamExecution won't generate empty batches. So we can resolve this race condition by synchronizing successive AddDataMemory actions against every MemoryStream together. Then we can be sure StreamExecution won't start generating a batch before all the data is present.
    
    ## How was this patch tested?
    existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jose-torres/spark flaky

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20646
    
----
commit d540be6bb051a33d2f6bd69a49fbe11afe9f0a65
Author: Jose Torres <jo...@...>
Date:   2018-02-20T23:34:16Z

    just use synchronization

commit d91c55f1a17b03aa2d46682e76c6eb207e71a521
Author: Jose Torres <jo...@...>
Date:   2018-02-20T23:38:35Z

    Merge branch 'master' of https://github.com/apache/spark into flaky

commit dce075f53c8a1418dac99c9b7b7f9b7e79ed17ff
Author: Jose Torres <jo...@...>
Date:   2018-02-20T23:45:40Z

    fix merge

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    **[Test build #87571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87571/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    **[Test build #87571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87571/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    I opened a new PR to test out an alternate approach. PTAL - https://github.com/apache/spark/pull/20650/files?w=1
    
    (note the `w=1`, that is to ignore whitespaces diffs in the diff view).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    (https://issues.apache.org/jira/browse/SPARK-23369 was already filed for previous flake)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20646#discussion_r169522041
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala ---
    @@ -429,7 +429,25 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
         val defaultCheckpointLocation =
           Utils.createTempDir(namePrefix = "streaming.metadata").getCanonicalPath
         try {
    -      startedTest.foreach { action =>
    +      val actionIterator = startedTest.iterator.buffered
    +      while (actionIterator.hasNext) {
    +        // Synchronize sequential addDataMemory actions.
    --- End diff --
    
    // Synchronize --> // Collectively synchronize .... actions so that the data gets added together in a single batch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    > java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile]
    
    Surely unrelated to this change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    @tdas 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    **[Test build #87572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87572/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87571/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20646#discussion_r169521344
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala ---
    @@ -429,7 +429,25 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be
         val defaultCheckpointLocation =
           Utils.createTempDir(namePrefix = "streaming.metadata").getCanonicalPath
         try {
    -      startedTest.foreach { action =>
    +      val actionIterator = startedTest.iterator.buffered
    +      while (actionIterator.hasNext) {
    +        // Synchronize sequential addDataMemory actions.
    +        val addDataMemoryActions = ArrayBuffer[AddDataMemory[_]]()
    +        while (actionIterator.hasNext && actionIterator.head.isInstanceOf[AddDataMemory[_]]) {
    +          addDataMemoryActions.append(actionIterator.next().asInstanceOf[AddDataMemory[_]])
    +        }
    +        if (addDataMemoryActions.nonEmpty) {
    +          val synchronizeAll = addDataMemoryActions
    --- End diff --
    
    This is some magic-ish code. Can you add a bit more comments on how this compose thing works?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...

Posted by jose-torres <gi...@git.apache.org>.
Github user jose-torres closed the pull request at:

    https://github.com/apache/spark/pull/20646


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87572/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    Actually, I am having second thoughts about this. This is fundamentally changing how the tests work, especially for stress tests. The stress tests actually test these corner cases (by randomly adding successive AddData) about what if data was being added while the previously added data is being picked up. With this change, we will accidentally not test those race-condition-prone cases. 
    
    Second, we are taking multiple locks here in multiple sources, and the StreamExecution is likely to take the same locks. I am really afraid that we are introducing deadlocks by doing this.
    
    I am still thinking what the right approach here. I think it should be
    - Explicit synchronized adding of data to multiple sources.
    - Not holding locks in multiple sources. 
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20646
  
    **[Test build #87572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87572/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org