You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/05 03:25:50 UTC

[GitHub] [iceberg] peterxiong13 commented on pull request #3990: Core: Use min sequence number on each partition to remove old delete files

peterxiong13 commented on PR #3990:
URL: https://github.com/apache/iceberg/pull/3990#issuecomment-1206001615

   @coolderli,
   I used the source code of 0.13.1 to modify, compile and test it. It's basically no problem. However, in the test, I found that some partitions failed to completely clean up the deleted files. I looked at it. The reason is that the serial number of the data files is smaller than that of the deleted files. The merge is run every 2 hours, and it is not merged? Is there no need to merge because the primary key of the data file does not appear in the deleted file? When Flink writes data, it turns on the upsert mode. In this way, if only pure insert data will also generate delete delete files?
   
   In addition, I also downloaded version 0.14.0, and found that this part of the modification is relatively large. The original modification seems to be inapplicable. Is there a PR to solve this problem?
   
   我用0.13.1的源码修改编译测了下,基本没问题。不过在测试中发现有部分分区出现删除文件没有完全清理掉的情况,我看了下,是因为有数据文件的序号比删除文件的小,合并是2小时跑一次,没有合并到?是否因为数据文件的主键没有出现在删除文件中,所以一直不需要合并?flink写数据过来的时候,开启了upsert模式,这样如果只是纯insert的数据也会产生delete的删除文件吗?
   另外我也下载了0.14.0版,发现这部分修改比较大,原修改好像不适用了,是否有解决此问题的pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org