You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/04 09:34:06 UTC

[GitHub] [iceberg] steveloughran commented on issue #4346: Make DeleteOrphanFiles in Spark reliable

steveloughran commented on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1117110626

   fwiw, s3a and abfs in the not yet released hadoop branc&3.3 adds an [EtagSouce](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/EtagSource.java) interface which FileStatus/LocatedFileStatus subclasses can implement. this lets you compare files, if the value is non null/empty, then files with different etags are guaranteed to be different.
   
   i know that iceberg likes to builld against very old versions of hadoop, but if you do leave space in the indices for file etags, and some pluggable mechanism to retrieve them, then etag based checking would work.
    
   note also s3 and abfs return those etags in list operations, there's no need to do HEAD calls on each file., and i think gcs does the same, though it doesn't have support through its client yet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org