You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 06:59:45 UTC

[GitHub] [spark] HeartSaVioR commented on pull request #35521: [SPARK-38212][SS] Remove out-of-watermark states for streaming dedup

HeartSaVioR commented on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1039922686


   I agree about the problem description, but I'd like to see more thoughtful solution.
   
   Suppose we have perfect watermark with no delay allowance, and there is no event being out of order, then streaming dedup will do nothing on deduplication because effectively it will register the row in the state and evict it immediately.
   
   Personally, for this case, applying TTL against state row would be more sense to me. If we don't want to enforce watermark for the functionality, then we will end up with wall/processing time for TTL which may fall into indeterministic result, but setting TTL as huge interval like 2 hours would be acceptable tolerating such behavior. If we want to enforce watermark to let TTL work (TTL working with event time column), we may even produce deterministic result except late events.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org