You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2019/10/28 11:49:05 UTC

[GitHub] [hadoop] steveloughran commented on issue #1679: HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext.

steveloughran commented on issue #1679: HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext.
URL: https://github.com/apache/hadoop/pull/1679#issuecomment-546911202
 
 
   It's too early to worry about check style failures; I'd like API reviews first. Thanks.
   
   One thing we haven't covered here, is what to do about parent directories.
   
   Although it is not needed for S3, I would like to say "Parent directory must exist". 
   
   Then the S3A uploader would add a specific option to disable this check. Why so? Because for real file systems you want to specify at the permissions of the parent directory, and I don't want to start adding that to the API given that mkdirs is there.
   
   Note also, that while this API it would seem sufficient to reimplement the S3A committers, in HADOOP-15183 we added a `BulkOperationState` which a metastore may issue and which for DynamoDB keeps track of which part it knows exists already -so avoid excessive/duplicate DynamoDB IO.
   
   For this multipart uploader to scale we'd have to call `MetadataStore.initiateBulkWrite()` get one of these, and use for both probes for parent dirs existing in upload and commit operations. the `BulkOperationState` would share the uploader's lifecycle and be closed with it,
   
   Important: we would need the same four copy operations, again to avoid excessive I/O. if I am copying 100 files, I don't want to make 100 *depth(file) calls to S3Guard. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org