You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 07:22:48 UTC

[GitHub] [spark] viirya edited a comment on pull request #35521: [SPARK-38212][SS] Remove out-of-watermark states for streaming dedup

viirya edited a comment on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1039933470


   > Suppose we have perfect watermark with no delay allowance, and there is no event being out of order, then streaming dedup will do nothing on deduplication because effectively it will register the row in the state and evict it immediately. This will happen if you use watermark but use it like the semantic of "processing time".
   
   How it could happen? IIUC, watermark predicate should be watermark column <= current watermark (max event time seen in last batch?). When no out of order events, isn't a input row's watermark column always > current watermark? (i.e. watermark predicate is false)? Why it will be evicted immediately? Won't it be evicted in next batch?
   
   TTL may be a solution here. Just watermark seems more commonly used in Structured Streaming operators, do we have any stateful operators with TTL? Or we need to introduce a state TTL mechanism for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org