You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2018/02/17 02:10:58 UTC

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/20631

    [SPARK-23454][SS][DOCS] Added trigger information to the  Structured Streaming programming guide

    ## What changes were proposed in this pull request?
    
    - Added clear information about triggers
    - Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-23454

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20631.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20631
    
----
commit 31bf6537817d258918135a9700df484e0d6343e9
Author: Tathagata Das <ta...@...>
Date:   2018-02-17T02:03:52Z

    Improved docs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87561/testReport)** for PR 20631 at commit [`4f13e40`](https://github.com/apache/spark/commit/4f13e40152b5c7911d7e6f569754bfa0aaf23836).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87561 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87561/testReport)** for PR 20631 at commit [`4f13e40`](https://github.com/apache/spark/commit/4f13e40152b5c7911d7e6f569754bfa0aaf23836).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/941/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87542/testReport)** for PR 20631 at commit [`4086237`](https://github.com/apache/spark/commit/4086237970c93bf6913fa176b60862e10a263904).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169438010
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start()
    +
    +{% endhighlight %}
    +
    +
    +</div>
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start();
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87570/testReport)** for PR 20631 at commit [`6ad07d8`](https://github.com/apache/spark/commit/6ad07d83459a98a7a914b7ed46cc512e80bc5ca5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87570/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87542/testReport)** for PR 20631 at commit [`4086237`](https://github.com/apache/spark/commit/4086237970c93bf6913fa176b60862e10a263904).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87515/testReport)** for PR 20631 at commit [`31bf653`](https://github.com/apache/spark/commit/31bf6537817d258918135a9700df484e0d6343e9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87570/testReport)** for PR 20631 at commit [`6ad07d8`](https://github.com/apache/spark/commit/6ad07d83459a98a7a914b7ed46cc512e80bc5ca5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r168941119
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2004,157 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start()
    +
    +{% endhighlight %}
    +
    +
    +</div>
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start();
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start();
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start();
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start();
    +
    +{% endhighlight %}
    +
    +</div>
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +
    +# Default trigger (runs micro-batch as soon as it can)
    +df.writeStream \
    +  .format("console") \
    +  .start()
    +
    +# ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream \
    +  .format("console") \
    +  .trigger(processingTime='2 seconds') \
    +  .start()
    +
    +# One-time trigger
    +df.writeStream \
    +  .format("console") \
    +  .trigger(once=True) \
    +  .start()
    +
    +# Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(continuous='1 second')
    +  .start()
    +
    +{% endhighlight %}
    --- End diff --
    
    R examples:
    
    ```
    # Default trigger (runs micro-batch as soon as it can)
    write.stream(df, "console")
    
    # ProcessingTime trigger with two-second micro-batch interval
    write.stream(df, "console", trigger.processingTime = "2 seconds")
    
    # One-time trigger
    write.stream(df, "console", trigger.once = TRUE)
    
    # Continuous trigger is not yet supported
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/960/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169429748
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1051,6 +1053,16 @@ from the aggregation column.
     For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append 
     output mode.
     
    +##### Semantic Guarantees of Aggregation with Watermarking
    +{:.no_toc}
    +
    +- A watermark delay (set with `withWatermark`) of "2 hours" guarantees that the engine will never
    +drop any data that is less than 2 hours delayed. In other words, any data less than 2 hours behind
    +(in terms of event-time) the latest data processed till then is guaranteed to be aggregated.
    +
    +- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is
    +not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less
    +likely is the engine going to process it.
    --- End diff --
    
    good catch. let me fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169437351
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    --- End diff --
    
    nit: then ~~the~~ shutdown


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169017434
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2004,157 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start()
    +
    +{% endhighlight %}
    +
    +
    +</div>
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start();
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start();
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start();
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start();
    +
    +{% endhighlight %}
    +
    +</div>
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +
    +# Default trigger (runs micro-batch as soon as it can)
    +df.writeStream \
    +  .format("console") \
    +  .start()
    +
    +# ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream \
    +  .format("console") \
    +  .trigger(processingTime='2 seconds') \
    +  .start()
    +
    +# One-time trigger
    +df.writeStream \
    +  .format("console") \
    +  .trigger(once=True) \
    +  .start()
    +
    +# Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(continuous='1 second')
    +  .start()
    +
    +{% endhighlight %}
    --- End diff --
    
    Thank you!!
    Can you add support for continuous trigger in R APIs?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    @zsxwing can you take a look?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    **[Test build #87515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87515/testReport)** for PR 20631 at commit [`31bf653`](https://github.com/apache/spark/commit/31bf6537817d258918135a9700df484e0d6343e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/976/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by bersprockets <gi...@git.apache.org>.

Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169146175
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1051,6 +1053,16 @@ from the aggregation column.
     For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append 
     output mode.
     
    +##### Semantic Guarantees of Aggregation with Watermarking
    +{:.no_toc}
    +
    +- A watermark delay (set with `withWatermark`) of "2 hours" guarantees that the engine will never
    +drop any data that is less than 2 hours delayed. In other words, any data less than 2 hours behind
    +(in terms of event-time) the latest data processed till then is guaranteed to be aggregated.
    +
    +- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is
    +not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less
    +likely is the engine going to process it.
    --- End diff --
    
    > However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is not guaranteed to be dropped
    
    This might contradict an earlier statement, from "Handling Late Data and Watermarking", that says
    
    "In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87561/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87542/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169438035
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start()
    +
    +{% endhighlight %}
    +
    +
    +</div>
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start();
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start();
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start();
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20631


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169437894
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    --- End diff --
    
    `Trigger.Continuous()` -> `Trigger.Continuous("1 second")`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169437575
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    --- End diff --
    
    nit: two-second`s`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/970/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169438480
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1979,6 +2006,172 @@ which has methods that get called whenever there is a sequence of rows generated
     
     - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks.
     
    +#### Triggers
    +The trigger settings of a streaming query defines the timing of streaming data processing, whether
    +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query.
    +Here are the different kinds of triggers that are supported.
    +
    +<table class="table">
    +  <tr>
    +    <th>Trigger Type</th>
    +    <th>Description</th>
    +  </tr>
    +  <tr>
    +    <td><i>unspecified (default)</i></td>
    +    <td>
    +        If no trigger setting is explicitly specified, then by default, the query will be
    +        executed in micro-batch mode, where micro-batches will be generated as soon as
    +        the previous micro-batch has completed processing.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Fixed interval micro-batches</b></td>
    +    <td>
    +        The query will be executed with micro-batches mode, where micro-batches will be kicked off
    +        at the user-specified intervals.
    +        <ul>
    +          <li>If the previous micro-batch completes within the interval, then the engine will wait until
    +          the interval is over before kicking off the next micro-batch.</li>
    +
    +          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
    +          interval boundary is missed), then the next micro-batch will start as soon as the
    +          previous one completes (i.e., it will not wait for the next interval boundary).</li>
    +
    +          <li>If no new data is available, then no micro-batch will be kicked off.</li>
    +        </ul>
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>One-time micro-batch</b></td>
    +    <td>
    +        The query will execute *only one* micro-batch to process all the available data and then
    +        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
    +        process everything that is available since the last period, and then the shutdown the
    +        cluster. In some case, this may lead to significant cost savings.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    +    <td>
    +        The query will be executed in the new low-latency, continuous processing mode. Read more
    +        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    +    </td>
    +  </tr>
    +</table>
    +
    +Here are a few code examples.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start()
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start()
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start()
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start()
    +
    +{% endhighlight %}
    +
    +
    +</div>
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +import org.apache.spark.sql.streaming.Trigger
    +
    +// Default trigger (runs micro-batch as soon as it can)
    +df.writeStream
    +  .format("console")
    +  .start();
    +
    +// ProcessingTime trigger with two-second micro-batch interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.ProcessingTime("2 seconds"))
    +  .start();
    +
    +// One-time trigger
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Once())
    +  .start();
    +
    +// Continuous trigger with one-second checkpointing interval
    +df.writeStream
    +  .format("console")
    +  .trigger(Trigger.Continuous())
    +  .start();
    +
    +{% endhighlight %}
    +
    +</div>
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +
    +# Default trigger (runs micro-batch as soon as it can)
    +df.writeStream \
    +  .format("console") \
    +  .start()
    +
    +# ProcessingTime trigger with two-second micro-batch interval
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87515/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Yes! That’s the plan
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20631: [SPARK-23454][SS][DOCS] Added trigger information to the...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20631
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org