You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/04/01 18:15:03 UTC

Hadoop input path - can it have subdirectories

Hi,

Can I give a directory (having subdirectories) as input path to Hadoop
Map-Reduce Job.
I tried, but got error.

Can Hadoop recursively traverse the input directory and collect all
the file names or the input path has to be just a directory containing
files (and no sub-directories) ?

-Taran

RE: Hadoop input path - can it have subdirectories

Posted by Jeff Eastman <je...@windwardsolutions.com>.
My experience running with the Java API is that subdirectories in the input
path do cause an exception, so the streaming file input processing must be
different. 

Jeff Eastman
 

> -----Original Message-----
> From: Norbert Burger [mailto:norbert.burger@gmail.com]
> Sent: Tuesday, April 01, 2008 9:46 AM
> To: core-user@hadoop.apache.org
> Cc: hadoop-user@lucene.apache.org
> Subject: Re: Hadoop input path - can it have subdirectories
> 
> Yes, this is fine, at least for Hadoop Streaming.  I specify the root of
> my
> logs directory as my -input parameter, and Hadoop correctly finds all of
> child directories.  What's the error you're seeing?  Is a stack trace
> available?
> 
> Norbert
> 
> On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <ta...@gmail.com>
> wrote:
> 
> > Hi,
> >
> > Can I give a directory (having subdirectories) as input path to Hadoop
> > Map-Reduce Job.
> > I tried, but got error.
> >
> > Can Hadoop recursively traverse the input directory and collect all
> > the file names or the input path has to be just a directory containing
> > files (and no sub-directories) ?
> >
> > -Taran
> >



Re: Hadoop input path - can it have subdirectories

Posted by Norbert Burger <no...@gmail.com>.
Yes, this is fine, at least for Hadoop Streaming.  I specify the root of my
logs directory as my -input parameter, and Hadoop correctly finds all of
child directories.  What's the error you're seeing?  Is a stack trace
available?

Norbert

On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <ta...@gmail.com>
wrote:

> Hi,
>
> Can I give a directory (having subdirectories) as input path to Hadoop
> Map-Reduce Job.
> I tried, but got error.
>
> Can Hadoop recursively traverse the input directory and collect all
> the file names or the input path has to be just a directory containing
> files (and no sub-directories) ?
>
> -Taran
>

Re: Hadoop input path - can it have subdirectories

Posted by Norbert Burger <no...@gmail.com>.
Sorry, I was wrong.  I just checked my installation, and by default,
streaming appears to work as people have described -- it doesn't recurse
subdirectories.  If I pass a directory containing only directories as the
-input parameter, I get the following error:

08/04/01 15:00:56 ERROR streaming.StreamJob: Error Launching job : Not a
file: hdfs://10.188.239.122:9000/kpi2/kpi3
Streaming Job Failed!

It will, however, enumerate all files in the input directory, provided the
input directory contains only files.  I think that's what confused me.

On Tue, Apr 1, 2008 at 1:51 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Peeyush Bishnoi wrote:
> > Hello ,
> >
> > No Hadoop can't  traverse recursively inside subdirectory with Java
> > Map-Reduce program. It have to be just directory containing files
> > (and no sub-directories).
>
> That's not the case.
>
> This is actually a characteristic of the InputFormat that you're using.
> Hadoop reads data using InputFormat-s, and standard implementations may
> indeed not support subdirectories - but if you need this functionality
> you can implement your own InputFormat.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Hadoop input path - can it have subdirectories

Posted by Andrzej Bialecki <ab...@getopt.org>.
Peeyush Bishnoi wrote:
> Hello ,
> 
> No Hadoop can't  traverse recursively inside subdirectory with Java
> Map-Reduce program. It have to be just directory containing files
> (and no sub-directories).

That's not the case.

This is actually a characteristic of the InputFormat that you're using.
Hadoop reads data using InputFormat-s, and standard implementations may
indeed not support subdirectories - but if you need this functionality
you can implement your own InputFormat.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Hadoop input path - can it have subdirectories

Posted by Tarandeep Singh <ta...@gmail.com>.
thanks Ted for that info [or hack :) ]

I had this directory structure -
  monthData
   |- week1
   |- week2

if I give monthData directory as input path, I get exception -

08/04/01 14:24:30 INFO mapred.FileInputFormat: Total input paths to process : 2
Exception in thread "main" java.io.IOException: Not a file:
hdfs://master:54310/user/hadoop/monthData/week1
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
        at com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)


However, if I give monthData/* as input directory, the jobs run :)

But I feel, the default behavior should be to recurse the parent
directory and get all the files in subdirectories. It is not difficult
to imagine a situation where one want to partition the data in
subdirectories that are kept in one base directory, so that the job
could be run either on base directory or on subdirectory.

-Taran

On Tue, Apr 1, 2008 at 12:04 PM, Ted Dunning <td...@veoh.com> wrote:
>
>  But wildcards that match directories that contain files work well.
>
>
>
>
>  On 4/1/08 10:41 AM, "Peeyush Bishnoi" <pe...@yahoo-inc.com> wrote:
>
>  > Hello ,
>  >
>  > No Hadoop can't  traverse recursively inside subdirectory with Java Map-Reduce
>  > program. It have to be just directory containing files (and no
>  > sub-directories).
>  >
>  >
>  > ---
>  > Peeyush
>  >
>  >
>  > -----Original Message-----
>  > From: Tarandeep Singh [mailto:tarandeep@gmail.com]
>  > Sent: Tue 4/1/2008 9:15 AM
>  > To: hadoop-user@lucene.apache.org
>  > Subject: Hadoop input path - can it have subdirectories
>  >
>  > Hi,
>  >
>  > Can I give a directory (having subdirectories) as input path to Hadoop
>  > Map-Reduce Job.
>  > I tried, but got error.
>  >
>  > Can Hadoop recursively traverse the input directory and collect all
>  > the file names or the input path has to be just a directory containing
>  > files (and no sub-directories) ?
>  >
>  > -Taran
>  >
>
>

Re: Hadoop input path - can it have subdirectories

Posted by Ted Dunning <td...@veoh.com>.
But wildcards that match directories that contain files work well.


On 4/1/08 10:41 AM, "Peeyush Bishnoi" <pe...@yahoo-inc.com> wrote:

> Hello ,
> 
> No Hadoop can't  traverse recursively inside subdirectory with Java Map-Reduce
> program. It have to be just directory containing files (and no
> sub-directories).
> 
> 
> ---
> Peeyush
> 
> 
> -----Original Message-----
> From: Tarandeep Singh [mailto:tarandeep@gmail.com]
> Sent: Tue 4/1/2008 9:15 AM
> To: hadoop-user@lucene.apache.org
> Subject: Hadoop input path - can it have subdirectories
>  
> Hi,
> 
> Can I give a directory (having subdirectories) as input path to Hadoop
> Map-Reduce Job.
> I tried, but got error.
> 
> Can Hadoop recursively traverse the input directory and collect all
> the file names or the input path has to be just a directory containing
> files (and no sub-directories) ?
> 
> -Taran
> 


RE: Hadoop input path - can it have subdirectories

Posted by Peeyush Bishnoi <pe...@yahoo-inc.com>.
Hello ,

No Hadoop can't  traverse recursively inside subdirectory with Java Map-Reduce program. It have to be just directory containing files (and no sub-directories).  


---
Peeyush


-----Original Message-----
From: Tarandeep Singh [mailto:tarandeep@gmail.com]
Sent: Tue 4/1/2008 9:15 AM
To: hadoop-user@lucene.apache.org
Subject: Hadoop input path - can it have subdirectories
 
Hi,

Can I give a directory (having subdirectories) as input path to Hadoop
Map-Reduce Job.
I tried, but got error.

Can Hadoop recursively traverse the input directory and collect all
the file names or the input path has to be just a directory containing
files (and no sub-directories) ?

-Taran