You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ian Soboroff <ia...@nist.gov> on 2009/02/03 22:11:03 UTC

FileInputFormat directory traversal

Is there a reason FileInputFormat only traverses the first level of  
directories in its InputPaths?  (i.e., given an InputPath of 'foo', it  
will get foo/* but not foo/bar/*).

I wrote a full depth-first traversal in my custom InputFormat which I  
can offer as a patch.  But to do it I had to duplicate the PathFilter  
classes in FileInputFormat which are marked private, so a mainline  
patch would also touch FileInputFormat.

Ian


Re: FileInputFormat directory traversal

Posted by Ian Soboroff <ia...@nist.gov>.
Hmm.  Based on your reasons, an extension to FileInputFormat for the  
lib package seems more in order.

I'll try to hack something up and file a Jira issue.

Ian

On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote:

> Hi, Ian.
>
> One reason is that a MapFile is represented by a directory  
> containing two files named "index" and "data".   
> SequenceFileInputFormat handles MapFiles too by, if an input file is  
> a directory containing a data file, using that file.
>
> Another reason is that's what reduces generate.
>
> Neither reason implies that this is the best or only way of doing  
> things.  It would probably be better if FileInputFormat optionally  
> supported recursive file enumeration.  (It would be incompatible and  
> thus cannot be the default mode.)
>
> Please file an issue in Jira for this and attach your patch.
>
> Thanks,
>
> Doug
>
> Ian Soboroff wrote:
>> Is there a reason FileInputFormat only traverses the first level of  
>> directories in its InputPaths?  (i.e., given an InputPath of 'foo',  
>> it will get foo/* but not foo/bar/*).
>> I wrote a full depth-first traversal in my custom InputFormat which  
>> I can offer as a patch.  But to do it I had to duplicate the  
>> PathFilter classes in FileInputFormat which are marked private, so  
>> a mainline patch would also touch FileInputFormat.
>> Ian


Re: FileInputFormat directory traversal

Posted by Doug Cutting <cu...@apache.org>.
Hi, Ian.

One reason is that a MapFile is represented by a directory containing 
two files named "index" and "data".  SequenceFileInputFormat handles 
MapFiles too by, if an input file is a directory containing a data file, 
using that file.

Another reason is that's what reduces generate.

Neither reason implies that this is the best or only way of doing 
things.  It would probably be better if FileInputFormat optionally 
supported recursive file enumeration.  (It would be incompatible and 
thus cannot be the default mode.)

Please file an issue in Jira for this and attach your patch.

Thanks,

Doug

Ian Soboroff wrote:
> Is there a reason FileInputFormat only traverses the first level of 
> directories in its InputPaths?  (i.e., given an InputPath of 'foo', it 
> will get foo/* but not foo/bar/*).
> 
> I wrote a full depth-first traversal in my custom InputFormat which I 
> can offer as a patch.  But to do it I had to duplicate the PathFilter 
> classes in FileInputFormat which are marked private, so a mainline patch 
> would also touch FileInputFormat.
> 
> Ian
>