You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2018/01/25 21:33:00 UTC
[jira] [Commented] (HADOOP-15192) S3A listStatus generates one 404 error for each path

    [ https://issues.apache.org/jira/browse/HADOOP-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340116#comment-16340116 ] 

Steve Loughran commented on HADOOP-15192:
-----------------------------------------

I feel your pain, as [getFileStatus()|http://steveloughran.blogspot.com/2016/12/how-long-does-filesystemexists-take.html] is slow indeed.

But the underlying problem isn't just those 404 probes for files, it's the LIST calls and the way a recursive treewalk is done by the client.

In HADOOP-13208 the listFiles(path, recursive=true), as you call out. That's in Hadoop 2.8.
 # try upgrading to Hadoop 2.8/2.9; you'll get lots of other improvements (HADOOP-11694) ; in particular, if you set fs.s3a.fadvise = random you even get an IO policy optimised for columnar data and so the actual work will be way faster.
 # I don't know whether Spark uses listFiles rather than listStatus; I've been round bits of it optimising out some things (needless exists checks), but this sounds like the (serialized) partitioning process.

There's nothing else we can really do in Hadoop, with the HADOOP-13208 offering that high-perf full tree listing; with S3Guard (Hadoop 3.0+) you get the offload of all dir lookups to DDB and significant speedup there.

If spark isn't using listStatus(), that's something that's going to need fixing there...I'll leave you with that homework. It doesn't need a new API and the base impl is just that recursive treewalk, so shouldn't be worse
h3. Answers
{quote}Why is this trailing slash got removed in the first place? (Hadoop Path class normalize it by removing trailing slashes when constructed)
{quote}
There are 3 probes
 # is there a file with that name (entry without /)
 # is there a directory marker (entry with /)
 # are there files (LIST with page size ==1 returns an entry)

{quote}(or fix listLocatedStatus(recursive=true) unoptimized call to listStatus)
{quote}
done, please upgrade.
{quote}there should be opportunities to have an efficient recursive listStatus implementation for s3 using paginated calls to top level folder only.
{quote}
there is, its done with listStatus(), it just needs to be picked up by downstream projects. That could be something where tangible user experience "my jobs are slow" can be compelling.

> S3A listStatus generates one 404 error for each path
> ----------------------------------------------------
>
>                 Key: HADOOP-15192
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15192
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Michel Lemay
>            Priority: Minor
>
> Symptoms:
>  - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx errors in our bucket
>  - Performance when listing files recursively is abysmal (15 minutes on our bucket compared to less than 2 minutes using cli `aws s3 ls`)
> Analysis:
>  - In CloudTrail logs for this bucket, we found that it generate one 404 (NoSuchKey) error per folder listed recursively.
>  - Spark recursively calls FileSystem::listStatus (S3AFileSystem implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to determine if it is a directory.
>  - It turns out that this call to getFileStatus yield a 404 when the path used is a directory but do not end with a slash. It then retries with the slash concatenated (incurring one extra unneeded call to S3).
> Questions:
>  - Why is this trailing slash got removed in the first place? (Hadoop Path class normalize it by removing trailing slashes when constructed)
>  - S3AFileSystem::listStatus needs to know if the path is a Directory. However, it’s a common usage pattern to already have that FileStatus object in hand when recursively listing files.  Thus incurring an unneeded performance penalty.  Base FileSystem class could offer an optimized Api to use this assumption (or fix listLocatedStatus(recursive=true) unoptimized call to listStatus)
>  - I might be wrong on this last bullet but I think S3 object api will fetch every objects under a prefix (not just current level) and filter them out.  If that is the case, there should be opportunities to have an efficient recursive listStatus implementation for s3 using paginated calls to top level folder only.
>  
> Note, all this is in the context of spark jobs reading hundred of thousands of parquet files organized and partitioned hierarchically as recommended. Every time we read it, spark lists recursively all files and folders to discover what are the partitions (folder names).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org