You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2008/03/26 19:17:25 UTC
[jira] Commented: (HADOOP-3095) Validating input paths and creating splits is slow on S3

    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582386#action_12582386 ] 

Doug Cutting commented on HADOOP-3095:
--------------------------------------

We should fix this generally, not just for S3, since, once we remove the HDFS status cache, it will be similarly slow.  A good approach would be to rework FileInputFormat to retain FileStatus instances.  FileInputFormat#listPaths() should be deprecated in favor of a listStatus() method.

As for validateInput, this can be deprecated too, or at least gutted.  Its utility was in the days when getSplits() was called on the JobTracker, and it was nice to get such error messages in the client before submitting the job.  But now that getSplits() is called in the client, any needed checks can be done there (i.e., it should throw exceptions for non-existent input paths).  So my vote would be to long-term eliminate validateInput() and short-term deprecate and gut its implementation in FileInputFormat.


> Validating input paths and creating splits is slow on S3
> --------------------------------------------------------
>
>                 Key: HADOOP-3095
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3095
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs, fs/s3
>            Reporter: Tom White
>
> A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
> But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
> 1. fs.isDirectory
> 2. fs.exists
> 3. fs.getLength
> 4. fs.getLength
> 5. fs.exists (from fs.getFileBlockLocations)
> 6. fs.getBlockSize
> So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
> For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.