You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/07 20:23:21 UTC

[GitHub] [iceberg] fbocse opened a new issue #1178: Parallelize metadata and data file removal on snapshot expiration?

fbocse opened a new issue #1178:
URL: https://github.com/apache/iceberg/issues/1178


   Any particular reason for which metadata and data file removal tasks aren't parallelized using thread pool? https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L319-L341


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] fbocse edited a comment on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
fbocse edited a comment on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655623016


   Oh I see, so throttling may occur if we parallelize these delete operations. How about looking at other `fs` functions to delete maybe like [deleteOnExit](https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#deleteOnExit)?
   Sounds like it's some sort of lazy delete file API that may work around throttling issues? I could look into what it's implementation looks like for S3 and ADLS but I think that this API would work better for this particular case. wdyt?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #1178:
URL: https://github.com/apache/iceberg/issues/1178


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] fbocse commented on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
fbocse commented on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655684441


   You're right, lack of consistency guarantees would make things complicated and eventually flawed.
   Your suggestion on allowing user to pass in a configurable thread pool should help w/ parallelizing this effort.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655140373


   It's probably safe to parallelize this, but I think we didn't do it originally because we don't want to get "slow down" responses from services like S3. Seems like adding an option to the API here would be a good idea. Maybe allow passing in a thread pool for deletes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] fbocse commented on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
fbocse commented on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655623016


   Oh I see, so throttling may occur if we parallelize these delete operations. How about looking at other `fs` functions to delete maybe like [fs.deleteOnExit] (https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#deleteOnExit(org.apache.hadoop.fs.Path))?
   Sounds like it's some sort of lazy delete file API that may work around throttling issues? I could look into what it's implementation looks like for S3 and ADLS but I think that this API would work better for this particular case. wdyt?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655643335


   I don't think we want to use `deleteOnExit`. That just queues the delete to happen in a shutdown hook, but I don't think there are many guarantees about reliability. That would make it so we no longer retry failed operations and might affect recovery when some, but not all of them fail. It also wouldn't help with parallelism because it would happen in a single shutdown hook thread.
   
   I think the best way to parallelize this operation is to run in a thread pool passed in by the user. Then the user could configure the parallelism of that pool to ensure that it is not so high that it hits throttling exceptions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-686918773


   Seems like this can be resolved?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] fbocse edited a comment on issue #1178: Parallelize metadata and data file removal on snapshot expiration?

Posted by GitBox <gi...@apache.org>.
fbocse edited a comment on issue #1178:
URL: https://github.com/apache/iceberg/issues/1178#issuecomment-655623016


   Oh I see, so throttling may occur if we parallelize these delete operations. 
   @rdblue How about looking at other `fs` functions to delete maybe like [fs.deleteOnExit](https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#deleteOnExit) instead of `ops.io().deleteFile(file)`?
   Sounds like it's some sort of lazy delete file API that may work around throttling issues? 
   
   I could look into what it's implementation looks like for S3 and ADLS but I think that this API would work better for this particular case. wdyt?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org