You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2021/11/03 05:05:08 UTC

[GitHub] [hadoop] sidseth commented on pull request #3597: HADOOP-17981 Support etag-assisted renames in FileOutputCommitter

sidseth commented on pull request #3597:
URL: https://github.com/apache/hadoop/pull/3597#issuecomment-958658416


   > > This mechanism becomes very FileSystem specific. Implemented by Azure right now.
   > 
   > I agree, which is why the API is restricted for its uses to mr-client-core only. as abfs is the only one which needs it for correctness under load, And I'm not worried about that specifity. Can I point to how much of the hadoop fs api are hdfs-only -and they are public.
   > 
   > > Other users of rename will not see the benefits without changing interfaces, which in turn requires shimming etc.
   > 
   > Please don't try and use this particular interface in Hive.
   > 
   Was referring to any potential usage - including Hive.
   > > Would it be better for AzureFileSystem rename itself to add a config parameter which can lookup the src etag (at the cost of a performance hit for consistency), so that downstream components / any users of the rename operation can benefit from this change without having to change interfaces.
   > 
   > We are going straight from a listing (1 request/500 entries) to that rename. doing a HEAD first cuts the throughtput in half. so no.
   > 
   In the scenario where this is encountered. Would not be the default behaviour, and limits the change to Abfs. Could also have the less consistent version which is not etag based, and responds only on failures. Again - limited to Abfs.
   > > Also, if the performance penalty is a big problem - Abfs could create very short-lived caches for FileStatus objects, and handle errors on discrepancies with the cached copy.
   > 
   > Possible but convoluted.
   > 
   Agree. Quite convoluted. Tossing in potential options - to avoid a new public API.
   > > Essentially - don't force usage of the new interface to get the benefits.
   > 
   > I understand the interests of the hive team, but this fix is not the place to do a better API.
   > 
   > Briefly cacheing the source FS entries is something to consider though. Not this week.
   > 
   > What I could do with is some help getting #2735 in, then we can start on a public rename() builder API which will take a file status, as openFile does.
   > 
   This particular change would be FSImpl agnostic, and potentially remove the need for the new interface here?
   > > Side note: The fs.getStatus within ResilientCommitByRenameHelper for FileSystems where this new functionality is not supported will lead to a performance penalty for the other FileSystems (performing a getFileStatus on src).
   > 
   > There is an option to say "i know it is not there"; this skips the check. the committer passes this option down because it issues a delete call first.
   > 
   EOD - this ends up being a new API (almost on the FileSystem), which is used by the committer first; then someone discovers it and decides to make use of it.
   > FWIW the manifest committer will make that pre-rename commit optional, saving that IO request. I am curious as to how well that will work I went executed on well formed tables.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org