You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Latha <us...@gmail.com> on 2008/10/18 15:08:27 UTC

supporting WordCount example for multiple level directories

Hi All

Greetings
The wordcount at
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
for following directory structure.

inputdir -> file1
           -> file2
           -> file3

And it does not work for
inputdir -> dir1 -> innerfile1
           -> file1
           -> file2
           -> dir2
For this second scenario we get error like
----
 branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
outlevel
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
----


So, when it encounters an entry that is not a file, it comes out after
throwing IO exception.
In FileInputFormat.java, would like to call a recursive procedure in the
following piece of code. All the files at leaf level of entire directory
structure should be included in the paths to be searched. If anyone has
already done this, please help me achieving the same.
------------------------------------------------------------------------------------------
public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    Path[] files = listPaths(job);
    long totalSize = 0;                           // compute total size
    for (int i = 0; i < files.length; i++) {      // check we have valid
files
      Path file = files[i];
      FileSystem fs = file.getFileSystem(job);
      if (fs.isDirectory(file) || !fs.exists(file)) {
        throw new IOException("Not a file: "+files[i]);
      }
      totalSize += fs.getLength(files[i]);
    }
....
------------------------------------------------------

Should we reset "mapred.input.dir" to inner directory and call getInputPaths
recursively?
Please help me to get all the file paths (irrespective of their depth level)
.

Thankyou
Srilatha

Re: supporting WordCount example for multiple level directories

Posted by Owen O'Malley <om...@apache.org>.
On Oct 18, 2008, at 6:08 AM, Latha wrote:

> And it does not work for
> inputdir -> dir1 -> innerfile1
>           -> file1
>           -> file2
>           -> dir2

Typically you don't mix files and directories in the same level. The  
easiest way to get the desired result would be to use a pattern to  
list the files and directories to read:

inputdir/dir*,inputdir/file*

would glob out to dir1, dir2, file1, and file2. The files would just  
include themselves and the directories would expand one level.

-- Owen

Re: supporting WordCount example for multiple level directories

Posted by Latha <us...@gmail.com>.
Apologies for pasting a wrong command .Please find the correct command I
used.

----
 branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
outdir
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
...
...
----

And inputdir has 2 subdirectories "dir1","dir2" and  a file "file1".

My requirement is to run wordcount for all the files in all sub directories.
Please suggest an idea.

Regards,
Srilatha


On Sat, Oct 18, 2008 at 6:38 PM, Latha <us...@gmail.com> wrote:

> Hi All
>
> Greetings
> The wordcount at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
> for following directory structure.
>
> inputdir -> file1
>            -> file2
>            -> file3
>
> And it does not work for
> inputdir -> dir1 -> innerfile1
>            -> file1
>            -> file2
>            -> dir2
> For this second scenario we get error like
> ----
>  branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
> outlevel
> 08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> java.io.IOException: Not a file:
> hdfs://localhost:5310/user/username/inputdir/dir1
> ----
>
>
> So, when it encounters an entry that is not a file, it comes out after
> throwing IO exception.
> In FileInputFormat.java, would like to call a recursive procedure in the
> following piece of code. All the files at leaf level of entire directory
> structure should be included in the paths to be searched. If anyone has
> already done this, please help me achieving the same.
>
> ------------------------------------------------------------------------------------------
> public InputSplit[] getSplits(JobConf job, int numSplits)
>     throws IOException {
>     Path[] files = listPaths(job);
>     long totalSize = 0;                           // compute total size
>     for (int i = 0; i < files.length; i++) {      // check we have valid
> files
>       Path file = files[i];
>       FileSystem fs = file.getFileSystem(job);
>       if (fs.isDirectory(file) || !fs.exists(file)) {
>         throw new IOException("Not a file: "+files[i]);
>       }
>       totalSize += fs.getLength(files[i]);
>     }
> ....
> ------------------------------------------------------
>
> Should we reset "mapred.input.dir" to inner directory and call
> getInputPaths recursively?
> Please help me to get all the file paths (irrespective of their depth
> level) .
>
> Thankyou
> Srilatha
>

Re: supporting WordCount example for multiple level directories

Posted by Latha <us...@gmail.com>.
Apologies for pasting a wrong command .Please find the correct command I
used.

----
 branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
outdir
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
...
...
----

And inputdir has 2 subdirectories "dir1","dir2" and  a file "file1".

My requirement is to run wordcount for all the files in all sub directories.
Please suggest an idea.

Regards,
Srilatha


On Sat, Oct 18, 2008 at 6:38 PM, Latha <us...@gmail.com> wrote:

> Hi All
>
> Greetings
> The wordcount at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
> for following directory structure.
>
> inputdir -> file1
>            -> file2
>            -> file3
>
> And it does not work for
> inputdir -> dir1 -> innerfile1
>            -> file1
>            -> file2
>            -> dir2
> For this second scenario we get error like
> ----
>  branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
> outlevel
> 08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> java.io.IOException: Not a file:
> hdfs://localhost:5310/user/username/inputdir/dir1
> ----
>
>
> So, when it encounters an entry that is not a file, it comes out after
> throwing IO exception.
> In FileInputFormat.java, would like to call a recursive procedure in the
> following piece of code. All the files at leaf level of entire directory
> structure should be included in the paths to be searched. If anyone has
> already done this, please help me achieving the same.
>
> ------------------------------------------------------------------------------------------
> public InputSplit[] getSplits(JobConf job, int numSplits)
>     throws IOException {
>     Path[] files = listPaths(job);
>     long totalSize = 0;                           // compute total size
>     for (int i = 0; i < files.length; i++) {      // check we have valid
> files
>       Path file = files[i];
>       FileSystem fs = file.getFileSystem(job);
>       if (fs.isDirectory(file) || !fs.exists(file)) {
>         throw new IOException("Not a file: "+files[i]);
>       }
>       totalSize += fs.getLength(files[i]);
>     }
> ....
> ------------------------------------------------------
>
> Should we reset "mapred.input.dir" to inner directory and call
> getInputPaths recursively?
> Please help me to get all the file paths (irrespective of their depth
> level) .
>
> Thankyou
> Srilatha
>