You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ka...@apache.org on 2021/02/18 00:35:15 UTC

[spark] branch master updated: [SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi)

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new a575e80  [SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi)
a575e80 is described below

commit a575e805a18a515f8707f74cf2b22777474f2f06
Author: Cheng Su <ch...@fb.com>
AuthorDate: Thu Feb 18 09:34:33 2021 +0900

    [SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi)
    
    ### What changes were proposed in this pull request?
    
    Per discussion in https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057, we should add documentation for added new features of full outer and left semi joins into SS programming guide.
    
    * Reworded the section for "Outer Joins with Watermarking", to make it work for full outer join. Updated the code snippet to show up full outer and left semi join.
    * Added one section for "Semi Joins with Watermarking", similar to "Outer Joins with Watermarking".
    * Updated "Support matrix for joins in streaming queries" to reflect latest fact for full outer and left semi join.
    
    ### Why are the changes needed?
    
    Good for users and developers to follow guide to try out these two new features.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. They will see the corresponding updated guide.
    
    ### How was this patch tested?
    
    No, just documentation change. Previewed the markdown file in browser.
    Also attached here for the change to the "Support matrix for joins in streaming queries" table.
    
    <img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png">
    
    Closes #31572 from c21/ss-doc.
    
    Authored-by: Cheng Su <ch...@fb.com>
    Signed-off-by: Jungtaek Lim <ka...@gmail.com>
---
 docs/structured-streaming-programming-guide.md | 60 +++++++++++++++++++-------
 1 file changed, 45 insertions(+), 15 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index bea38ed..9c6d47b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1098,7 +1098,7 @@ likely is the engine going to process it.
 Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame
 as well as another streaming Dataset/DataFrame. The result of the streaming join is generated
 incrementally, similar to the results of streaming aggregations in the previous section. In this
-section we will explore what type of joins (i.e. inner, outer, etc.) are supported in the above
+section we will explore what type of joins (i.e. inner, outer, semi, etc.) are supported in the above
 cases. Note that in all the supported join types, the result of the join with a streaming
 Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame
 containing the same data in the stream.
@@ -1318,8 +1318,8 @@ A watermark delay of "2 hours" guarantees that the engine will never drop any da
  2 hours delayed. But data delayed by more than 2 hours may or may not get processed.
 
 ##### Outer Joins with Watermarking
-While the watermark + event-time constraints is optional for inner joins, for left and right outer
-joins they must be specified. This is because for generating the NULL results in outer join, the
+While the watermark + event-time constraints is optional for inner joins, for outer joins
+they must be specified. This is because for generating the NULL results in outer join, the
 engine must know when an input row is not going to match with anything in future. Hence, the
 watermark + event-time constraints must be specified for generating correct results. Therefore,
 a query with outer-join will look quite like the ad-monetization example earlier, except that
@@ -1337,7 +1337,7 @@ impressionsWithWatermark.join(
     clickTime >= impressionTime AND
     clickTime <= impressionTime + interval 1 hour
     """),
-  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter"
+  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter", "fullOuter", "leftSemi"
  )
 
 {% endhighlight %}
@@ -1352,7 +1352,7 @@ impressionsWithWatermark.join(
     "clickAdId = impressionAdId AND " +
     "clickTime >= impressionTime AND " +
     "clickTime <= impressionTime + interval 1 hour "),
-  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter"
+  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter", "fullOuter", "leftSemi"
 );
 
 {% endhighlight %}
@@ -1369,7 +1369,7 @@ impressionsWithWatermark.join(
     clickTime >= impressionTime AND
     clickTime <= impressionTime + interval 1 hour
     """),
-  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter"
+  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter", "fullOuter", "leftSemi"
 )
 
 {% endhighlight %}
@@ -1386,7 +1386,7 @@ joined <- join(
       "clickAdId = impressionAdId AND",
       "clickTime >= impressionTime AND",
       "clickTime <= impressionTime + interval 1 hour"),
-  "left_outer"                 # can be "inner", "left_outer", "right_outer"
+  "left_outer"                 # can be "inner", "left_outer", "right_outer", "full_outer", "left_semi"
 ))
 
 {% endhighlight %}
@@ -1415,6 +1415,18 @@ generation of the outer result may get delayed if there no new data being receiv
 *In short, if any of the two input streams being joined does not receive data for a while, the
 outer (both cases, left or right) output may get delayed.*
 
+##### Semi Joins with Watermarking
+A semi join returns values from the left side of the relation that has a match with the right.
+It is also referred to as a left semi join. Similar to outer joins, watermark + event-time
+constraints must be specified for semi join. This is to evict unmatched input rows on left side,
+the engine must know when an input row on left side is not going to match with anything on right
+side in future.
+
+###### Semantic Guarantees of Stream-stream Semi Joins with Watermarking
+{:.no_toc}
+Semi joins have the same guarantees as [inner joins](#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking)
+regarding watermark delays and whether data will be dropped or not.
+
 ##### Support matrix for joins in streaming queries
 
 <table class ="table">
@@ -1434,8 +1446,8 @@ outer (both cases, left or right) output may get delayed.*
       </td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
-    <td rowspan="4" style="vertical-align: middle;">Static</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
+    <td rowspan="5" style="vertical-align: middle;">Static</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">Supported, not stateful</td>
   </tr>
@@ -1452,8 +1464,12 @@ outer (both cases, left or right) output may get delayed.*
     <td style="vertical-align: middle;">Not supported</td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Static</td>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">Supported, not stateful</td>
+  </tr>
+  <tr>
+    <td rowspan="5" style="vertical-align: middle;">Static</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">Supported, not stateful</td>
   </tr>
@@ -1470,8 +1486,12 @@ outer (both cases, left or right) output may get delayed.*
     <td style="vertical-align: middle;">Not supported</td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">Not supported</td>
+  </tr>
+  <tr>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">
       Supported, optionally specify watermark on both sides +
@@ -1494,9 +1514,19 @@ outer (both cases, left or right) output may get delayed.*
   </tr>
   <tr>
     <td style="vertical-align: middle;">Full Outer</td>
-    <td style="vertical-align: middle;">Not supported</td>
+    <td style="vertical-align: middle;">
+      Conditionally supported, must specify watermark on one side + time constraints for correct
+      results, optionally specify watermark on the other side for all state cleanup
+    </td>
+  </tr>
+  <tr>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">
+      Conditionally supported, must specify watermark on right + time constraints for correct
+      results, optionally specify watermark on left for all state cleanup
+    </td>
   </tr>
- <tr>
+  <tr>
     <td></td>
     <td></td>
     <td></td>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org