You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/13 22:33:40 UTC

[GitHub] [iceberg] jackye1995 commented on pull request #2867: Flink: Auto compact file

jackye1995 commented on pull request #2867:
URL: https://github.com/apache/iceberg/pull/2867#issuecomment-918632602

I have also been following this thread although I did no make any comment. Let me add some thoughts since I see you are making some new changes.

I am mostly on the same line of thought as @stevenzwu, I am a bit worried about the scalability of the current implementation, and I think the parallel commit proposal that @rdblue proposed could work, but in the end running compaction in streaming pipeline is likely unnecessary complication.

So far we have been advocating for streaming pipelines to just commit new files to storage, and use a separated process to handle compaction at the same time. Having the streaming pipeline also do compaction would mean that there might be 2 compaction processes competing with each other. This becomes especially complicated and prone to error when you have both batch jobs and streaming pipelines running at the same time (e.g. normal streaming + daily loading of corrected and late data). I understand it is likely a good optimization for simple use cases, but I would expect it to be a feature with a lot of in-depth knowledge to use safely and correctly if we open it for general usage.

I wonder what is the initial drive behind this implementation. Do you just want to avoid a separated Spark cluster to run compaction in Spark? If we have Flink actions specifically for `RewriteDataFiles` and `RewriteDeleteFiles` that you can schedule on the same Flink cluster, would that solve the issue?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org