You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/28 11:40:03 UTC

[GitHub] [iceberg] hililiwei commented on pull request #6253: Flink: Write watermark to the snapshot summary

hililiwei commented on PR #6253:
URL: https://github.com/apache/iceberg/pull/6253#issuecomment-1328934746

   hi @rdblue,  thank you so much for your feedback.
   Before I answer your question, I'd like to say something else around this.
   
   Actually, I want to solve a problem that is common in streaming scenarios: how does the downstream application know that all of the specified period data has been written if the application is not reading incremental data in real time, but is microbatching? Watermarking is an easy solution to think of. When downstream applications get the watermark, they can know whether the event time or process time of the data has reached a critical value. But it's not enough. 
   
   If our community accepts this PR solution, I would like to do one more thing, which is to support time-based partition commit. In some scenarios, when a new partition is written, it is usually necessary to notify the downstream application. For example, When all the data for this partition is written, commit this partition to iceberg, just as flink does for [hive\filesystem](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/filesystem/#partition-commit).  I once participated in the development of this part of Flink, and I hope to introduce it to iceberg sink. Because of iceberg's snapshot management feature, we may be able to do better than hive\filesystem. 
   
   The current iceberg flink sink can only commit based on checkpoint. When the time-based commit is complete, it provides a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of triggers and policies.
   
   Back to your question.
   1. In PR, the commiter caches the watemark of the current data stream and writes it to the summary when the snapshot is committed. Strictly speaking, it doesn't represent the watermark of the iceberg table, if other applications are writing in at the same time, as shown in the second figure. If there's only one application, and we can think of it as.
   
   ![image](https://user-images.githubusercontent.com/59213263/204259097-55623449-9a38-4d85-b93c-b380154eb8f0.png)
   
   2. Where does the water level come from?
   Flink provides an interface method for us to catch it. Its value depends on the low-watermark of upstream data. 
   
   ![image](https://user-images.githubusercontent.com/59213263/204259514-b257455b-67ee-4f2e-9b7d-3d979ac4d72b.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org