You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/17 19:40:08 UTC

[GitHub] [iceberg] amogh-jahagirdar commented on pull request #4052: Add deleteFiles to FileIO. S3FileIO implementation will perform a batch deletion using RemoveObjects API

amogh-jahagirdar commented on pull request #4052:
URL: https://github.com/apache/iceberg/pull/4052#issuecomment-1043347180


   Sorry for the delay folks. Few updates:
   
   1.)  updated the PR with some integration tests and more unit tests. 
   
   2.) The deletion batch size is configurable through s3.delete.batch-size.
   
   3.) The default is 250 instead of 1000. Tbh I think some rigorous benchmarking should be done here. I set to 250 mostly mimicking the similar change done in hadoop-aws https://github.com/apache/hadoop/commit/56dee667707926f3796c7757be1a133a362f05c9 which also used to perform batch deletions in 1000 until encountering major throttling issues. For reference if there are N keys in a batch, this will uses N requests in your throughput calculation done by S3 for controlling throttling. S3 limitations are 3500 TPS per prefix. So if we did 1000, in the worst case where most of the keys fall in the same prefix (if somebody has a hive-like file structure) then we would easily hit this limitation easily. If prefixes are better distributed we could get more throughput, but don't think we should rely on this assumption.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org