You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2020/12/07 12:12:55 UTC

[GitHub] [hadoop] steveloughran commented on pull request #2168: HADOOP-16202. Enhance/Stabilize openFile()

steveloughran commented on pull request #2168:
URL: https://github.com/apache/hadoop/pull/2168#issuecomment-739880426


   The latest patch rounds things off. This thing is ready to go in. 
   * We now have the option to specify the start and end of splits; the input formats in the MR client do this.
   * everywhere in the code where we explicitly download sequential datasets request sequential IO. (actually, I've just realised `hadoop fs -head <path>` should request random IO as well as declare split lengths...we don't want a full GET).
   
   its important that FS implementations don't rely on split length to set max file len, because splits are allowed to overrun to ensure a whole record/block is read. Apps which pass split info down to worker processes (hive &c) need to pass in file size too if they want to save the HEAD request. It could still be used by the input streams if they can think of a way 
   
   1. For sequential IO: end of content length = min(split-end, file-length) for that initial request,
   2 For random IO, assume it's the initial EOF. 
   
   because openFile() declares FNFEs can be delayed until reads, we could also see if we could do an async HEAD request while processing that first GET/HEAD, so have the final file length without blocking. That would make streams more complex —at least now we have the option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org