You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Latha <us...@gmail.com> on 2008/10/18 15:08:27 UTC
supporting WordCount example for multiple level directories
Hi All
Greetings
The wordcount at
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
for following directory structure.
inputdir -> file1
-> file2
-> file3
And it does not work for
inputdir -> dir1 -> innerfile1
-> file1
-> file2
-> dir2
For this second scenario we get error like
----
branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
outlevel
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
----
So, when it encounters an entry that is not a file, it comes out after
throwing IO exception.
In FileInputFormat.java, would like to call a recursive procedure in the
following piece of code. All the files at leaf level of entire directory
structure should be included in the paths to be searched. If anyone has
already done this, please help me achieving the same.
------------------------------------------------------------------------------------------
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
Path[] files = listPaths(job);
long totalSize = 0; // compute total size
for (int i = 0; i < files.length; i++) { // check we have valid
files
Path file = files[i];
FileSystem fs = file.getFileSystem(job);
if (fs.isDirectory(file) || !fs.exists(file)) {
throw new IOException("Not a file: "+files[i]);
}
totalSize += fs.getLength(files[i]);
}
....
------------------------------------------------------
Should we reset "mapred.input.dir" to inner directory and call getInputPaths
recursively?
Please help me to get all the file paths (irrespective of their depth level)
.
Thankyou
Srilatha
Re: supporting WordCount example for multiple level directories
Posted by Owen O'Malley <om...@apache.org>.
On Oct 18, 2008, at 6:08 AM, Latha wrote:
> And it does not work for
> inputdir -> dir1 -> innerfile1
> -> file1
> -> file2
> -> dir2
Typically you don't mix files and directories in the same level. The
easiest way to get the desired result would be to use a pattern to
list the files and directories to read:
inputdir/dir*,inputdir/file*
would glob out to dir1, dir2, file1, and file2. The files would just
include themselves and the directories would expand one level.
-- Owen
Re: supporting WordCount example for multiple level directories
Posted by Latha <us...@gmail.com>.
Apologies for pasting a wrong command .Please find the correct command I
used.
----
branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
outdir
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
...
...
----
And inputdir has 2 subdirectories "dir1","dir2" and a file "file1".
My requirement is to run wordcount for all the files in all sub directories.
Please suggest an idea.
Regards,
Srilatha
On Sat, Oct 18, 2008 at 6:38 PM, Latha <us...@gmail.com> wrote:
> Hi All
>
> Greetings
> The wordcount at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
> for following directory structure.
>
> inputdir -> file1
> -> file2
> -> file3
>
> And it does not work for
> inputdir -> dir1 -> innerfile1
> -> file1
> -> file2
> -> dir2
> For this second scenario we get error like
> ----
> branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
> outlevel
> 08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> java.io.IOException: Not a file:
> hdfs://localhost:5310/user/username/inputdir/dir1
> ----
>
>
> So, when it encounters an entry that is not a file, it comes out after
> throwing IO exception.
> In FileInputFormat.java, would like to call a recursive procedure in the
> following piece of code. All the files at leaf level of entire directory
> structure should be included in the paths to be searched. If anyone has
> already done this, please help me achieving the same.
>
> ------------------------------------------------------------------------------------------
> public InputSplit[] getSplits(JobConf job, int numSplits)
> throws IOException {
> Path[] files = listPaths(job);
> long totalSize = 0; // compute total size
> for (int i = 0; i < files.length; i++) { // check we have valid
> files
> Path file = files[i];
> FileSystem fs = file.getFileSystem(job);
> if (fs.isDirectory(file) || !fs.exists(file)) {
> throw new IOException("Not a file: "+files[i]);
> }
> totalSize += fs.getLength(files[i]);
> }
> ....
> ------------------------------------------------------
>
> Should we reset "mapred.input.dir" to inner directory and call
> getInputPaths recursively?
> Please help me to get all the file paths (irrespective of their depth
> level) .
>
> Thankyou
> Srilatha
>
Re: supporting WordCount example for multiple level directories
Posted by Latha <us...@gmail.com>.
Apologies for pasting a wrong command .Please find the correct command I
used.
----
branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
outdir
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
...
...
----
And inputdir has 2 subdirectories "dir1","dir2" and a file "file1".
My requirement is to run wordcount for all the files in all sub directories.
Please suggest an idea.
Regards,
Srilatha
On Sat, Oct 18, 2008 at 6:38 PM, Latha <us...@gmail.com> wrote:
> Hi All
>
> Greetings
> The wordcount at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
> for following directory structure.
>
> inputdir -> file1
> -> file2
> -> file3
>
> And it does not work for
> inputdir -> dir1 -> innerfile1
> -> file1
> -> file2
> -> dir2
> For this second scenario we get error like
> ----
> branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
> outlevel
> 08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> java.io.IOException: Not a file:
> hdfs://localhost:5310/user/username/inputdir/dir1
> ----
>
>
> So, when it encounters an entry that is not a file, it comes out after
> throwing IO exception.
> In FileInputFormat.java, would like to call a recursive procedure in the
> following piece of code. All the files at leaf level of entire directory
> structure should be included in the paths to be searched. If anyone has
> already done this, please help me achieving the same.
>
> ------------------------------------------------------------------------------------------
> public InputSplit[] getSplits(JobConf job, int numSplits)
> throws IOException {
> Path[] files = listPaths(job);
> long totalSize = 0; // compute total size
> for (int i = 0; i < files.length; i++) { // check we have valid
> files
> Path file = files[i];
> FileSystem fs = file.getFileSystem(job);
> if (fs.isDirectory(file) || !fs.exists(file)) {
> throw new IOException("Not a file: "+files[i]);
> }
> totalSize += fs.getLength(files[i]);
> }
> ....
> ------------------------------------------------------
>
> Should we reset "mapred.input.dir" to inner directory and call
> getInputPaths recursively?
> Please help me to get all the file paths (irrespective of their depth
> level) .
>
> Thankyou
> Srilatha
>