You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by td...@apache.org on 2018/02/03 01:37:56 UTC
spark git commit: [SPARK-23064][SS][DOCS] Stream-stream joins
Documentation - follow up
Repository: spark
Updated Branches:
refs/heads/master eefec93d1 -> eaf35de24
[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up
## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.
## How was this patch tested?
N/A
Author: Tathagata Das <ta...@gmail.com>
Closes #20494 from tdas/SPARK-23064-2.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eaf35de2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eaf35de2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eaf35de2
Branch: refs/heads/master
Commit: eaf35de2471fac4337dd2920026836d52b1ec847
Parents: eefec93
Author: Tathagata Das <ta...@gmail.com>
Authored: Fri Feb 2 17:37:51 2018 -0800
Committer: Tathagata Das <ta...@gmail.com>
Committed: Fri Feb 2 17:37:51 2018 -0800
----------------------------------------------------------------------
docs/structured-streaming-programming-guide.md | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/eaf35de2/docs/structured-streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index 62589a6..48d6d0b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1346,10 +1346,20 @@ joined <- join(
</div>
</div>
-However, note that the outer NULL results will be generated with a delay (depends on the specified
-watermark delay and the time range condition) because the engine has to wait for that long to ensure
+
+There are a few points to note regarding outer joins.
+
+- *The outer NULL results will be generated with a delay that depends on the specified watermark
+delay and the time range condition.* This is because the engine has to wait for that long to ensure
there were no matches and there will be no more matches in future.
+- In the current implementation in the micro-batch engine, watermarks are advanced at the end of a
+micro-batch, and the next micro-batch uses the updated watermark to clean up state and output
+outer results. Since we trigger a micro-batch only when there is new data to be processed, the
+generation of the outer result may get delayed if there no new data being received in the stream.
+*In short, if any of the two input streams being joined does not receive data for a while, the
+outer (both cases, left or right) output may get delayed.*
+
##### Support matrix for joins in streaming queries
<table class ="table">
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org