You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Sumit Kumar (JIRA)" <ji...@apache.org> on 2014/06/02 18:29:01 UTC
[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015515#comment-14015515 ] 

Sumit Kumar commented on MAPREDUCE-5907:
----------------------------------------

Added new changes to:
1. use iterator based listLocatedStatus apis instead of listStatus apis. Removed "not-required" recursive flavors of listStatus apis that could cause memory concerns raised in HADOOP-10634
2. change s3n implementation to use the listLocatedStatus api abstraction, this validated the implementation of the new recursive apis as well.
3. added a test case that demonstrates how such a recursive listing benefits s3N. The test case simulates an hourly rotated log aggregation and processing a year long data. Total number of calls reduces to just 10 instead of 360 calls (12 months * 30 days).
4. Fixed few bugs in InMemoryNativeFileSystemStore while validating the test case.

I did spend sometime on Swift object store implementation but it doesn't have that iteration based abstraction (neither at store level, nor at the filesystem level). Looking at the recursive implementation, for swift fs, it seems that it would try to get all the files/directories from the backend in just one webservice call. I suspect it would suffer from memory issues when such recursive calls are made. I may be wrong though so please correct me if i'm wrong.

[~stevel@apache.org] How should we deal with this? Are you aware of an iterative webservice api where we could list a swift fs directory recursively but in batches of say 1000 or 10000 entries (as may seem appropriate).

> Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5907
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 2.4.0
>            Reporter: Sumit Kumar
>            Assignee: Sumit Kumar
>         Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)