You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tdas <gi...@git.apache.org> on 2017/01/04 02:04:19 UTC

[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/16468

    [SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode

    ## What changes were proposed in this pull request?
    
    Updates
    - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
    - Updated Output Modes section with Update mode 
    
    
    ## How was this patch tested?
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-19074

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16468
    
----
commit cebdc3bcd2c36d78412df80a68acde0b9b1bc9be
Author: Tathagata Das <ta...@gmail.com>
Date:   2017-01-04T01:59:35Z

    Updated text and figures for update mode

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94877350
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -954,49 +1014,93 @@ There are a few types of built-in output sinks.
     
     - **File sink** - Stores the output to a directory. 
     
    +{% highlight scala %}
    +writeStream
    +    .format("parquet")        // can be "orc", "json", "csv", etc.
    +    .option("path", "path/to/destination/dir")
    +    .start()
    +{% endhighlight %}
    +
     - **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.
     
    +{% highlight scala %}
    +writeStream
    +    .foreach(...)
    +    .start()
    +{% endhighlight %}
    +
     - **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
     
    -- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.  Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
    +{% highlight scala %}
    +writeStream
    +    .format("console")
    +    .start()
    +{% endhighlight %}
    +
    +- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
    +Both, Append and Complete output modes, are supported. This should be used for debugging purposes
    +on low data volumes as the entire output is collected and stored in the driver's memory after
    --- End diff --
    
    I kind of agree it repetitive, but i dont want people to miss this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by thomaso-mirodin <gi...@git.apache.org>.
Github user thomaso-mirodin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94866762
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -433,6 +433,51 @@ In Spark 2.0, there are a few built-in sources.
     
       - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. 
     
    +Here are all the source details.
    +
    +<table class="table">
    +  <tr>
    +    <th>Source</th>
    +    <th>Options</th>
    +    <th>Fault-tolerant</th>
    --- End diff --
    
    Can we link back to `#fault-tolerance-semantics` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by spoddutur <gi...@git.apache.org>.
Github user spoddutur commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Hi TD, 
    
    As part of 2.1.0 release, Kafka as a source is added. 
    SPARK-17346: Kafka 0.10 support in Structured Streaming. 
    Wondering if kinesis support will be added in future. If yes, When can we expect it?
    
    Reason for asking kinesis support is, we use kinesis spark streaming with spark 1.6 as of now and are planning to upgrade to Spark 2 Structured Streaming. So, kinda eager to know when can we expect kinesis support in StructuredStreaming.
    
    Thanks in Advance,
    Sruthi


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    @david-weiluo-ren yeah the wording can be better. maybe "all of the operations ... are not yet supported" 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70952/testReport)** for PR 16468 at commit [`d29ee29`](https://github.com/apache/spark/commit/d29ee29be8ad5974f5d3eec866887669411b0248).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70952/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by david-weiluo-ren <gi...@git.apache.org>.
Github user david-weiluo-ren commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    @tdas 
    It says \u201cHowever, note that all of the operations applicable on static DataFrames/Datasets are not supported in streaming DataFrames/Datasets yet\u201d in https://spark.apache.org/docs/2.1.0/structured-streaming-programming-guide.html#unsupported-operations
    
    I think it should be \u201cnot all of the operations \u2026. are supported in \u2026 yet\u201d instead of \u201call of the operations \u2026 are not supported in \u2026 yet"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94876743
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -433,6 +433,51 @@ In Spark 2.0, there are a few built-in sources.
     
       - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. 
     
    +Here are all the source details.
    +
    +<table class="table">
    +  <tr>
    +    <th>Source</th>
    +    <th>Options</th>
    +    <th>Fault-tolerant</th>
    --- End diff --
    
    I dont want to make the table heading a link, but I will do something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70852/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94890250
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -954,49 +1014,93 @@ There are a few types of built-in output sinks.
     
     - **File sink** - Stores the output to a directory. 
     
    +{% highlight scala %}
    +writeStream
    +    .format("parquet")        // can be "orc", "json", "csv", etc.
    +    .option("path", "path/to/destination/dir")
    +    .start()
    +{% endhighlight %}
    +
     - **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.
     
    +{% highlight scala %}
    +writeStream
    +    .foreach(...)
    +    .start()
    +{% endhighlight %}
    +
     - **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
     
    -- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.  Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
    +{% highlight scala %}
    +writeStream
    +    .format("console")
    +    .start()
    +{% endhighlight %}
    +
    +- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
    +Both, Append and Complete output modes, are supported. This should be used for debugging purposes
    +on low data volumes as the entire output is collected and stored in the driver's memory after
    --- End diff --
    
    Aah sorry I misunderstood. I thought the note in above the table and the Notes in the table was the repetition. But that's not the case. My bad. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Thank you very much @thomaso-mirodin @david-weiluo-ren @zsxwing 
    I have addressed your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70895/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70853/testReport)** for PR 16468 at commit [`fbacbf4`](https://github.com/apache/spark/commit/fbacbf4f26afc5bd67a014b2134a5c97cb33cfda).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70895/testReport)** for PR 16468 at commit [`8f01f56`](https://github.com/apache/spark/commit/8f01f563fad307ea1594a81c3c17e1871927bd52).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by thomaso-mirodin <gi...@git.apache.org>.
Github user thomaso-mirodin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94866235
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -954,49 +1014,93 @@ There are a few types of built-in output sinks.
     
     - **File sink** - Stores the output to a directory. 
     
    +{% highlight scala %}
    +writeStream
    +    .format("parquet")        // can be "orc", "json", "csv", etc.
    +    .option("path", "path/to/destination/dir")
    +    .start()
    +{% endhighlight %}
    +
     - **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.
     
    +{% highlight scala %}
    +writeStream
    +    .foreach(...)
    +    .start()
    +{% endhighlight %}
    +
     - **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
     
    -- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.  Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
    +{% highlight scala %}
    +writeStream
    +    .format("console")
    +    .start()
    +{% endhighlight %}
    +
    +- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
    +Both, Append and Complete output modes, are supported. This should be used for debugging purposes
    +on low data volumes as the entire output is collected and stored in the driver's memory after
    --- End diff --
    
    This is slightly repetitive, it says "[...] the entire output is collected and stored in the driver's memory [...]" is said again in the next sentence as well "Note that the current implementations saves all the data in the driver memory".
    
    If we want to say this twice to make sure people read it; maybe we can move the "note" reminder into the `Notes` column  in the table a few lines down? :D


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94671342
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -910,35 +923,37 @@ Here is the compatibility matrix.
       </tr>
       <tr>
         <td colspan="2" valign="middle"><br/>Queries without aggregation</td>
    -    <td>Append</td>
    +    <td>Append, Update</td>
         <td>
             Complete mode note supported as it is infeasible to keep all data in the Result Table.
    --- End diff --
    
    nit: Complete mode **note** supported


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70852/testReport)** for PR 16468 at commit [`3285a2d`](https://github.com/apache/spark/commit/3285a2db9bf037eff1a9b7a18e961da0ad529f29).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70895/testReport)** for PR 16468 at commit [`8f01f56`](https://github.com/apache/spark/commit/8f01f563fad307ea1594a81c3c17e1871927bd52).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70852/testReport)** for PR 16468 at commit [`3285a2d`](https://github.com/apache/spark/commit/3285a2db9bf037eff1a9b7a18e961da0ad529f29).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94698982
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -433,6 +433,51 @@ In Spark 2.0, there are a few built-in sources.
     
       - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. 
     
    +Here are all the source details.
    +
    +<table class="table">
    +  <tr>
    +    <th>Source</th>
    +    <th>Options</th>
    +    <th>Fault-tolerant</th>
    +    <th>Notes</th>
    +  </tr>
    +  <tr>
    +    <td><b>File source</b></td>
    +    <td>
    +        <code>path</code>: path to the input directory, and common to all file formats.
    +        <br/><br/>
    +        For file-format-specific options, see the related methods in <code>DataStreamReader</code>
    +        (<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>).
    +        E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code></td>
    +    <td>Yes</td>
    +    <td>Supports regular expressions, but does not support multiple comma-separated paths/expressions.</td>
    --- End diff --
    
    nit: `regular expressions` -> `glob paths`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70952/testReport)** for PR 16468 at commit [`d29ee29`](https://github.com/apache/spark/commit/d29ee29be8ad5974f5d3eec866887669411b0248).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    **[Test build #70853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70853/testReport)** for PR 16468 at commit [`fbacbf4`](https://github.com/apache/spark/commit/fbacbf4f26afc5bd67a014b2134a5c97cb33cfda).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Good catch about non-aggregation queries. we should support update mode, which is same as append mode. I will fix that in a follow up PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70853/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94672967
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -910,35 +923,37 @@ Here is the compatibility matrix.
       </tr>
       <tr>
         <td colspan="2" valign="middle"><br/>Queries without aggregation</td>
    -    <td>Append</td>
    +    <td>Append, Update</td>
    --- End diff --
    
    nit: it doesn't support `update`. See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L76


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16468


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16468: [SPARK-19074][SS][DOCS] Updated Structured Stream...

Posted by thomaso-mirodin <gi...@git.apache.org>.
Github user thomaso-mirodin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16468#discussion_r94866947
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -954,49 +1014,93 @@ There are a few types of built-in output sinks.
     
     - **File sink** - Stores the output to a directory. 
     
    +{% highlight scala %}
    +writeStream
    +    .format("parquet")        // can be "orc", "json", "csv", etc.
    +    .option("path", "path/to/destination/dir")
    +    .start()
    +{% endhighlight %}
    +
     - **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.
     
    +{% highlight scala %}
    +writeStream
    +    .foreach(...)
    +    .start()
    +{% endhighlight %}
    +
     - **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
     
    -- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.  Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
    +{% highlight scala %}
    +writeStream
    +    .format("console")
    +    .start()
    +{% endhighlight %}
    +
    +- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
    +Both, Append and Complete output modes, are supported. This should be used for debugging purposes
    +on low data volumes as the entire output is collected and stored in the driver's memory after
    +every trigger. Note that the current implementations saves all the data in the driver memory.
    +Hence, use it with caution.
    +
    +{% highlight scala %}
    +writeStream
    +    .format("memory")
    +    .queryName("tableName")
    +    .start()
    +{% endhighlight %}
     
    -Here is a table of all the sinks, and the corresponding settings.
    +
    +Here are all the sinks details.
     
     <table class="table">
       <tr>
         <th>Sink</th>
         <th>Supported Output Modes</th>
    -    <th style="width:30%">Usage</th>
    +    <th>Options</th>
         <th>Fault-tolerant</th>
    --- End diff --
    
    Ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16468
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org