You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by td...@apache.org on 2018/02/03 01:37:56 UTC

spark git commit: [SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

Repository: spark
Updated Branches:
  refs/heads/master eefec93d1 -> eaf35de24


[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das <ta...@gmail.com>

Closes #20494 from tdas/SPARK-23064-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eaf35de2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eaf35de2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eaf35de2

Branch: refs/heads/master
Commit: eaf35de2471fac4337dd2920026836d52b1ec847
Parents: eefec93
Author: Tathagata Das <ta...@gmail.com>
Authored: Fri Feb 2 17:37:51 2018 -0800
Committer: Tathagata Das <ta...@gmail.com>
Committed: Fri Feb 2 17:37:51 2018 -0800

----------------------------------------------------------------------
 docs/structured-streaming-programming-guide.md | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/eaf35de2/docs/structured-streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index 62589a6..48d6d0b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1346,10 +1346,20 @@ joined <- join(
 </div>
 </div>
 
-However, note that the outer NULL results will be generated with a delay (depends on the specified
-watermark delay and the time range condition) because the engine has to wait for that long to ensure
+
+There are a few points to note regarding outer joins.
+
+- *The outer NULL results will be generated with a delay that depends on the specified watermark
+delay and the time range condition.* This is because the engine has to wait for that long to ensure
 there were no matches and there will be no more matches in future.
 
+- In the current implementation in the micro-batch engine, watermarks are advanced at the end of a
+micro-batch, and the next micro-batch uses the updated watermark to clean up state and output
+outer results. Since we trigger a micro-batch only when there is new data to be processed, the
+generation of the outer result may get delayed if there no new data being received in the stream.
+*In short, if any of the two input streams being joined does not receive data for a while, the
+outer (both cases, left or right) output may get delayed.*
+
 ##### Support matrix for joins in streaming queries
 
 <table class ="table">


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org