You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by c-horn <gi...@git.apache.org> on 2018/06/29 23:05:16 UTC

[GitHub] spark pull request #21676: [SPARK-24699][SQL][WIP] Watermark / Append mode s...

GitHub user c-horn opened a pull request:

    https://github.com/apache/spark/pull/21676

    [SPARK-24699][SQL][WIP] Watermark / Append mode should work with Trigger.Once

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-24699
    
    Structured streaming using `Trigger.Once` does not persist watermark state between batches, causing streams to never yield output. I will attach some scripts that reproduce the issue in the Jira issue.
    
    It seems like the microbatcher only calculates the watermark off of the previous batch's input and emits new aggs based off of that timestamp. I believe the issue here is that the previous batch state is not persisted to the checkpoint, and therefore cannot be used when the stream is started again with `Trigger.Once`.
    
    I will investigate ways of fixing this but I am definitely interested in input from anyone who worked on SS.
    
    ## How was this patch tested?
    
    Failing unit test provided.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/c-horn/spark SPARK-24699

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21676.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21676
    
----
commit 1b42cc4a449248da65402a6ea2112c55a3bb8501
Author: Chris Horn <ch...@...>
Date:   2018-06-29T22:54:45Z

    a failing test case

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SQL][WIP] Watermark / Append mode should w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    hey @c-horn , I am ready to merge your PR, and to add you as coauthor i think i need to know your email address i the github account. Can you provide me that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    ping ^^^


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SQL][WIP] Watermark / Append mode should w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Here is my solution based on my suggestion - https://github.com/apache/spark/pull/21746
    I stole your unit test from this PR :) Thank you! I will add you as a co-author in that PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    I think the right solution is to record the updateat watermark in the commit log, so that the updated watermark can be read back from the commit log next time the stream is started.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    The offset log contains the watermark value that is going to be used in the batch corresponding to that offset. For example, "checkpoint/offsets/10" will contain the watermark value to be used for batch 10. The problem is that when batch 10 completes and new watermark values is computed, it is not saved in a persistent location until batch 11 is planned and "offsets/11" is written out. In trigger.once, this never happens as the query is terminated as soon as batch 10 completes. So the new watermark value is not saved. If the query running in trigger.once mode right from the beginning, that is batch 0, then no new watermark value is ever written, and so the watermark shows  up always as 0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    @tdas I merged your changes into my branch, test passed, thank you 👍 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    I was under the assumption that the offset log contained this data?
    
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L32
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L81
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L260
    
    It does not seem to pull the correct data, however.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    @tdas @marmbrus


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Hi @tdas sorry for delay.
    My email for github account: chorn4033@gmail.com
    
    This looks fine to me, we can close this PR (and jira ticket) when yours is merged.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21676: [SPARK-24699][SS][WIP] Watermark / Append mode sh...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn closed the pull request at:

    https://github.com/apache/spark/pull/21676


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SQL][WIP] Watermark / Append mode should w...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    already resolved by https://github.com/apache/spark/pull/21746


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21676: [SPARK-24699][SS][WIP] Watermark / Append mode should wo...

Posted by c-horn <gi...@git.apache.org>.
Github user c-horn commented on the issue:

    https://github.com/apache/spark/pull/21676
  
    Changing `OneTimeExecutor` like this resolves this issue:
    ```
    case class OneTimeExecutor() extends TriggerExecutor {
    
      /**
       * Execute a single batch using `batchRunner`.
       */
    -  override def execute(batchRunner: () => Boolean): Unit = batchRunner()
    +  override def execute(batchRunner: () => Boolean): Unit = batchRunner() && batchRunner()
    }
    ```
    ... but the type becomes semantically incorrect.
    
    Is this an acceptable solution? it appears that a lot of the `MicroBatchExecution` code makes assumptions about state from the previous batch, which may or may not be realized in the first iteration of a stream restart.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org