You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/27 02:41:41 UTC

[GitHub] [iceberg] coolderli opened a new pull request #3990: Core: Use min sequence number on each partition to remove old delete files

coolderli opened a new pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990


   In https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L137, we use the min sequence number to filter the delete-manifests, then we can find the older delete file and drop them.
   
   But if we have some cold partition that never compaction again, then the min sequence number will never change. That makes the older delete file in other partitions will never be dropped. This will be worse and worse. It will make the spark driver oom and finally, the table becomes unavailable.
   
   In this PR, I use a map that contains a min sequence number on each partition instead of the global min sequence number. But I found the code changed a lot. I am not sure it's a good solution.
   
   And this PR can solve another problem. Currently, when we enabled partial compaction, we didn't drop the delete files as well. Because we can not make sure the delete files are not referenced by other data files when commit replace. That will leave a lot of useless delete files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #3990: Core: Use min sequence number on each partition to remove old delete files

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990#issuecomment-1041137652


   Thanks for the pointer.  This looks like it reads each manifest to add a cache of partitions in memory , but the main concern is maybe that commit() should be fast and not OOM if there are a lot of partitions..  Probably the best way is to do a separate Spark job?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on pull request #3990: Core: Use min sequence number on each partition to remove old delete files

Posted by GitBox <gi...@apache.org>.
coolderli commented on pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990#issuecomment-1041143130


   > Thanks for the pointer. This looks like it reads each manifest to add a cache of partitions in memory , but the main concern is maybe that commit() should be fast and not OOM if there are a lot of partitions.. Probably the best way is to do a separate Spark job?
   
   Yes, I can't agree anymore. Maybe a rewrite delete manifest should be considered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on pull request #3990: Core: Use min sequence number on each partition to remove old delete files

Posted by GitBox <gi...@apache.org>.
coolderli commented on pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990#issuecomment-1022962719


   @rdblue Could you please review this?Are there any other better solutions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on a change in pull request #3990: Core: Use min sequence number on each partition to remove old delete files

Posted by GitBox <gi...@apache.org>.
coolderli commented on a change in pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990#discussion_r793363591



##########
File path: core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
##########
@@ -86,7 +86,10 @@ public String partition() {
   private final Map<ManifestFile, Iterable<F>> filteredManifestToDeletedFiles =
       Maps.newConcurrentMap();
 
-  protected ManifestFilterManager(Map<Integer, PartitionSpec> specsById) {
+  // tracking the min sequence number of data files on each partition
+  private final Map<String, Long> minSequenceNumberByPartition = Maps.newHashMap();

Review comment:
       There I use the `String` as the key of the map because `StructLikeMap` only supports one partition spec. When we scan the data manifest files, there may be multi-different partition specs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on pull request #3990: Core: Use min sequence number on each partition to remove old delete files

Posted by GitBox <gi...@apache.org>.
coolderli commented on pull request #3990:
URL: https://github.com/apache/iceberg/pull/3990#issuecomment-1031043226


   @rdblue @jackye1995 Could you take a look at this, please? I'm not sure if this is the right direction. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org