You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jim the Standing Bear <st...@gmail.com> on 2007/12/18 02:40:30 UTC

question on file, inputformats and outputformats

Hi,

I looked at different file types, input and output formats, but got
quite confused, and am not sure how to connect the pipe from one
format to another.

Here is what I would like to do:

1. Pass in a string to my hadoop program, and it will write this
single key-value pair to a file on the fly.

2. The first job will read from this file, do some processing, and
write more key-value pairs to other files (the same format as the file
in step 1). Subsequent jobs will read from those files generated by
the first job. This will continue in an iterative manner until some
terminal condition has reached.

3. Both the key and value in the file should be text (i.e. Human
readable ascii).

While this sounds simple, I have been having trouble figuring out the
correct formats to use, and here is why:

JobConf.setInputKeyClass, and setInputValueClass are both deprecated,
so I am avoiding them.

SequenceFileOutputFormat doesn't work because the key has to be
IntWriteable and a Text key causes the code to blow up. (which I still
dont quite understand why, because when I use a SequenceFile.Writer,
it can take Text for both keys and values)

KeyValueTextInputFormat looks promising, but I am not sure how to
bootstrap the first file mentioned in step 1, i.e. what formats and
writer I should use to create the file to hold the initial argument...

I have a feeling that this is actually a very simple problem, only
that I am not looking at the right direction.  Your help would be
greatly appreciated.

-- Jim

Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.
Just do:

$ echo "DIR\t/foo/bar/directory" > file
$ hadoop -put file hfile

And you got yourself a file.

On 12/17/07 7:10 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> Hi Ted,
> 
> I guess I didn't make it clear enough.  I don't have a file to start
> with.  When I run the program, I pass in an argument.  The program,
> before doing its map/red jobs, is supposed to create a file on the
> DFS, and saves whatever I just passed in.  And my trouble is, I am not
> sure how to create such a file so that both the key and values are
> clear Text, and they can subsequently be read by
> KeyValueTextInputFormat.
> 
> On Dec 17, 2007 10:07 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> 
>> I thought that is what your input file already was.  The
>> KeyValueTextInputFormat should read your input as-is.
>> 
>> When you write out your intermediate values, just make sure that you use
>> TextOutputFormat and put "DIR" as the key and the directory name as the
>> value (same with files).
>> 
>> 
>> 
>> On 12/17/07 6:46 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>> 
>>> With KeyValueTextInputFormat, the problem is not reading it - I know
>>> how to set the separator byte and all that... my problem is with
>>> creating the very first file - I simply don't know how.
>> 
>> 
> 
> 


Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
Hi Ted,

I guess I didn't make it clear enough.  I don't have a file to start
with.  When I run the program, I pass in an argument.  The program,
before doing its map/red jobs, is supposed to create a file on the
DFS, and saves whatever I just passed in.  And my trouble is, I am not
sure how to create such a file so that both the key and values are
clear Text, and they can subsequently be read by
KeyValueTextInputFormat.

On Dec 17, 2007 10:07 PM, Ted Dunning <td...@veoh.com> wrote:
>
>
> I thought that is what your input file already was.  The
> KeyValueTextInputFormat should read your input as-is.
>
> When you write out your intermediate values, just make sure that you use
> TextOutputFormat and put "DIR" as the key and the directory name as the
> value (same with files).
>
>
>
> On 12/17/07 6:46 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
> > With KeyValueTextInputFormat, the problem is not reading it - I know
> > how to set the separator byte and all that... my problem is with
> > creating the very first file - I simply don't know how.
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.

I thought that is what your input file already was.  The
KeyValueTextInputFormat should read your input as-is.

When you write out your intermediate values, just make sure that you use
TextOutputFormat and put "DIR" as the key and the directory name as the
value (same with files).


On 12/17/07 6:46 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> With KeyValueTextInputFormat, the problem is not reading it - I know
> how to set the separator byte and all that... my problem is with
> creating the very first file - I simply don't know how.


Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
Hi Ted,

Yes, I got quite confused and picked TextInputFormat because I thought
it would be easy to understand.

To be more specific on what I am trying to do:

I pass in the path to a directory (say "/usr/mydir/bigtree").  The
code writes this to a file:  DIR <TAB> /usr/mydir/bigtree

The job will read data from the file, and if it gets a DIR, it will
walk into it, and list everything that directory has, and write the
contents to another file.  sub-directories will have "DIR" as their
keys, and files will have "FILE".  Then the same job configuration
will read off the new data file, and do the same thing again and
again, until there is no more directories to be walked.  So in the
end, there should be a file containing all the files under a directory
(not necessarily directly under).

Now that you told me about the generics, I am hoping the reason
sequence file didn't work for me because I didn't set the correct
type. I shall try that again.

With KeyValueTextInputFormat, the problem is not reading it - I know
how to set the separator byte and all that... my problem is with
creating the very first file - I simply don't know how.  I can use
SequenceFile.Writer to write the key and value, but the file contains
a header, some funny-looking separator and sync bytes.  If I simply
want a file containing clean Key<Text>\tValue<Text>, I dont know what
kind of Writer to use to create it.  Do you know of a way?    Thanks.

-- Jim

On Dec 17, 2007 9:01 PM, Ted Dunning <td...@veoh.com> wrote:
>
>
> Part of your problem is that you appear to be using a TextInputFormat (the
> default input format).  The TIF produces keys that are LongWritable and
> values that are Text.
>
> Other input formats produce different types.
>
> With recent versions of hadoop, classes that extend InputFormatBase can (and
> I think should) use templates to describe their output types.  Similarly,
> classes extending MapReduceBase and OutputFormat can specify input/output
> classes and output classes respectively.
>
> I have added more specific comments in-line.
>
> On 12/17/07 5:40 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
>
> > 1. Pass in a string to my hadoop program, and it will write this
> > single key-value pair to a file on the fly.
>
> How is your string a key-value pair?
>
> Assuming that you have something as simple as tab-delimited text, you may
> not need to do anything at all other than just copy this data into hadoop.
>
> > 2. The first job will read from this file, do some processing, and
> > write more key-value pairs to other files (the same format as the file
> > in step 1). Subsequent jobs will read from those files generated by
> > the first job. This will continue in an iterative manner until some
> > terminal condition has reached.
>
> Can you be more specific?
>
> Let's assume that you are reading tab-delimited data.  You should set the
> input format:
>
>         conf.setInputFormat(TextInputFormat.class);
>
> Then, since the output of your map will have a string key and value, you
> should tell the system this:
>
>        step1.setOutputKeyClass(Text.class);
>        step1.setOutputValueClass(Text.class);
>
> Note that the signature on your map function should be:
>
>    public static class JoinMap extends MapReduceBase
>     implements Mapper<LongWritable, Text, Text, Text> {
>             ...
>
>         public void map(LongWritable k, Text input,
>                         OutputCollector<Text, Text> output,
>                         Reporter reporter) throws IOException {
>             String[] parts = input.split("\t");
>
>             Text key, result;
>                 ...
>             output.collect(key, result);
>         }
>     }
>
> And your reduce should look something like this:
>
>     public static class JoinReduce extends MapReduceBase implements
>             Reducer<Text, Text, Text, Mumble> {
>
>         public void reduce(Text k, Iterator<Text> values,
>                            OutputCollector<Text, Mumble> output,
>                            Reporter reporter) throws IOException {
>             Text key;
>             Mumble result;
>                 ....
>             output.collect(key, result);
>         }
>     }
>
>
> > KeyValueTextInputFormat looks promising
>
> This could work, depending on what data you have for input.  Set the
> separator byte to be whatever separates your key from your value and off you
> go.
>
>
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.
The program can create the file just as easily as the shell commands that I
gave you.  You can open an output stream to a file in the hadoop file system
and write the seed data.


On 12/17/07 7:33 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> Hi Ted,
> 
> I see... I was hoping that the program could create it instead of
> having the user do it, but I guess hadoop is not really meant to be
> interactive/user-friendly.
> 
> About the second step and why I didn't say what input format it
> used... In the code, I did specify the format.  However, it depended
> upon the file output formats I used in the first step.    Because I
> got so confused, I thought it would be more important to nail down the
> correct output format in the first step.
> 
> -- Jim
> 
> On Dec 17, 2007 10:24 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> I was saying that you didn't do it and probably should have.
>> 
>> 
>> 
>> On 12/17/07 7:12 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>> 
>>> When you said "You never set the input format in the second step",
>>> were you instructing me NOT to set input format in the second step, or
>>> were you asking me why I never set it in the second step?
>>> 
>>> 
>>> On Dec 17, 2007 10:09 PM, Ted Dunning <td...@veoh.com> wrote:
>>>> 
>>>> You never set the input format in the second step.
>>>> 
>>>> But I think you want to stay with your KeyValueTextInputFormat for input
>>>> and
>>>> TextOutputFormat for output.
>>>> 
>>>> 
>>>> 
>>>> On 12/17/07 7:03 PM, "Jim the Standing Bear" <st...@gmail.com>
>>>> wrote:
>>>> 
>>>>> 
>>>>> So that's a part of the reason that I am having trouble connecting the
>>>>> pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
>>>>> are talking about two different kinds of "sequence files"...
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 


Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
Hi Ted,

I see... I was hoping that the program could create it instead of
having the user do it, but I guess hadoop is not really meant to be
interactive/user-friendly.

About the second step and why I didn't say what input format it
used... In the code, I did specify the format.  However, it depended
upon the file output formats I used in the first step.    Because I
got so confused, I thought it would be more important to nail down the
correct output format in the first step.

-- Jim

On Dec 17, 2007 10:24 PM, Ted Dunning <td...@veoh.com> wrote:
>
> I was saying that you didn't do it and probably should have.
>
>
>
> On 12/17/07 7:12 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
> > When you said "You never set the input format in the second step",
> > were you instructing me NOT to set input format in the second step, or
> > were you asking me why I never set it in the second step?
> >
> >
> > On Dec 17, 2007 10:09 PM, Ted Dunning <td...@veoh.com> wrote:
> >>
> >> You never set the input format in the second step.
> >>
> >> But I think you want to stay with your KeyValueTextInputFormat for input and
> >> TextOutputFormat for output.
> >>
> >>
> >>
> >> On 12/17/07 7:03 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
> >>
> >>>
> >>> So that's a part of the reason that I am having trouble connecting the
> >>> pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
> >>> are talking about two different kinds of "sequence files"...
> >>
> >>
> >
> >
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.
I was saying that you didn't do it and probably should have.


On 12/17/07 7:12 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> When you said "You never set the input format in the second step",
> were you instructing me NOT to set input format in the second step, or
> were you asking me why I never set it in the second step?
> 
> 
> On Dec 17, 2007 10:09 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> You never set the input format in the second step.
>> 
>> But I think you want to stay with your KeyValueTextInputFormat for input and
>> TextOutputFormat for output.
>> 
>> 
>> 
>> On 12/17/07 7:03 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>> 
>>> 
>>> So that's a part of the reason that I am having trouble connecting the
>>> pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
>>> are talking about two different kinds of "sequence files"...
>> 
>> 
> 
> 


Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
When you said "You never set the input format in the second step",
were you instructing me NOT to set input format in the second step, or
were you asking me why I never set it in the second step?


On Dec 17, 2007 10:09 PM, Ted Dunning <td...@veoh.com> wrote:
>
> You never set the input format in the second step.
>
> But I think you want to stay with your KeyValueTextInputFormat for input and
> TextOutputFormat for output.
>
>
>
> On 12/17/07 7:03 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
> >
> > So that's a part of the reason that I am having trouble connecting the
> > pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
> > are talking about two different kinds of "sequence files"...
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
Hi Arun,

I  did specify the input format.  The first job's output format is
SequenceFileOutputFormat, and the second job's input format is
SequenceFileInputFormat.  But it seems that the two formats don't
connect.
Is there a reason that setKeyInputClass and setValueInputClass are
being deprecated?  I saw these two being used, even in nutch.

Please see the code snippet below:

<code>
        JobConf writeJob = new JobConf(SequenceFileIndexer.class);
        writeJob.setJobName("testing");
        writeJob.setInputFormat(SequenceFileInputFormat.class);
        writeJob.setInputPath(path);

        Path outPath = new Path("write-out");
        writeJob.setOutputPath(outPath);
        writeJob.setOutputFormat(SequenceFileOutputFormat.class);
        writeJob.setMapperClass(SequenceFileIndexer.class);

        JobClient.runJob(writeJob);  // this job finished correctly

        JobConf secondJob = new JobConf(SequenceFileIndexer.class);
        secondJob.setJobName("second");
        secondJob.setInputFormat(SequenceFileInputFormat.class);
        secondJob.setInputPath(outPath);
        secondJob.setOutputKeyClass(Text.class);
        secondJob.setOutputValueClass(Text.class);
        Path finalPath = new Path("final");
        secondJob.setOutputPath(finalPath);
        secondJob.setMapperClass(SequenceFileIndexer.class);
        JobClient.runJob(secondJob);  // but this job blew up because
it complains the file format is not correct



      public void map(Text key, Text val,
            OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {

        String x = val.toString();
        String k = key.toString();

        output.collect(key, val);

    }





</code>

On Dec 18, 2007 12:27 AM, Arun C Murthy <ar...@yahoo-inc.com> wrote:
> Jim,
>
>    Hopefully you've fixed this and gone ahead; just in case...
>
>    You were right in using SequenceFile with <Text, Text> as the
> key/value types for your first job.
>
>    The problem is that you did not specify an *input-format* for your
> second job. The Hadoop Map-Reduce framework assumes TextInputFormat as
> the default, which is <LongWritable, Text> and hence the
> behaviour/exceptions you ran into...
>
> hth,
> Arun
>
> PS: Do take a look at
> http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html,
> specifically the section titled Job Input
> (http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#Job+Input).
>
> Do let us know if how and where we should improve it... Thanks!
>
>
>
> Jim the Standing Bear wrote:
> > Just an update... my problem seems to be beyond defining generic types.
> >
> > Ted, I dont know if you have the answer for this question, which is
> > regarding SequenceFile.
> >
> > If I am to create a SequenceFile by hand, I can do the following:
> >
> > <code>
> > JobConf jobConf = new JobConf(MyClass.class);
> > JobClient jobClient = new JobClient(jobConf);
> >
> > FileSystem fileSystem = jobClient.getFs();
> > SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
> > jobConf, path, Text.class, Text.class);
> >
> > </code>
> >
> > After that, I can write all Text-based keys and values by doing this:
> >
> > <code>
> > Text keyText = new Text();
> > keyText.set("mykey");
> >
> > Text valText = new Text();
> > valText.set("myval");
> >
> > writer.append(keyText, valText);
> > </code>
> >
> > As you can see, there is no LongWriteable what-so-ever.
> >
> > However, in a map/reduce job, if I am to specify
> > <code>
> > jobConf.setOutputFormat(SequenceFileOutputFormat.class);
> > </code>
> >
> > And later in the mapper, if I am to say
> > <code>
> > Text newkey = new Text();
> > newkey.set("AAA");
> >
> > Text newval = new Text();
> > newval.set("bbb");
> >
> > output.collect(newkey, newval);
> > </code>
> >
> > It would throw an exception, complaining that the key is not LongWriteable.
> >
> > So that's a part of the reason that I am having trouble connecting the
> > pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
> > are talking about two different kinds of "sequence files"...
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: question on file, inputformats and outputformats

Posted by Arun C Murthy <ar...@yahoo-inc.com>.
Jim,

   Hopefully you've fixed this and gone ahead; just in case...

   You were right in using SequenceFile with <Text, Text> as the 
key/value types for your first job.

   The problem is that you did not specify an *input-format* for your 
second job. The Hadoop Map-Reduce framework assumes TextInputFormat as 
the default, which is <LongWritable, Text> and hence the 
behaviour/exceptions you ran into...

hth,
Arun

PS: Do take a look at 
http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html, 
specifically the section titled Job Input 
(http://lucene.apache.org/hadoop/docs/r0.15.1/mapred_tutorial.html#Job+Input).

Do let us know if how and where we should improve it... Thanks!


Jim the Standing Bear wrote:
> Just an update... my problem seems to be beyond defining generic types.
> 
> Ted, I dont know if you have the answer for this question, which is
> regarding SequenceFile.
> 
> If I am to create a SequenceFile by hand, I can do the following:
> 
> <code>
> JobConf jobConf = new JobConf(MyClass.class);
> JobClient jobClient = new JobClient(jobConf);
> 
> FileSystem fileSystem = jobClient.getFs();
> SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
> jobConf, path, Text.class, Text.class);
> 
> </code>
> 
> After that, I can write all Text-based keys and values by doing this:
> 
> <code>
> Text keyText = new Text();
> keyText.set("mykey");
> 
> Text valText = new Text();
> valText.set("myval");
> 
> writer.append(keyText, valText);
> </code>
> 
> As you can see, there is no LongWriteable what-so-ever.
> 
> However, in a map/reduce job, if I am to specify
> <code>
> jobConf.setOutputFormat(SequenceFileOutputFormat.class);
> </code>
> 
> And later in the mapper, if I am to say
> <code>
> Text newkey = new Text();
> newkey.set("AAA");
> 
> Text newval = new Text();
> newval.set("bbb");
> 
> output.collect(newkey, newval);
> </code>
> 
> It would throw an exception, complaining that the key is not LongWriteable.
> 
> So that's a part of the reason that I am having trouble connecting the
> pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
> are talking about two different kinds of "sequence files"...


Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.
You never set the input format in the second step.

But I think you want to stay with your KeyValueTextInputFormat for input and
TextOutputFormat for output.


On 12/17/07 7:03 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> 
> So that's a part of the reason that I am having trouble connecting the
> pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
> are talking about two different kinds of "sequence files"...


Re: question on file, inputformats and outputformats

Posted by Jim the Standing Bear <st...@gmail.com>.
Just an update... my problem seems to be beyond defining generic types.

Ted, I dont know if you have the answer for this question, which is
regarding SequenceFile.

If I am to create a SequenceFile by hand, I can do the following:

<code>
JobConf jobConf = new JobConf(MyClass.class);
JobClient jobClient = new JobClient(jobConf);

FileSystem fileSystem = jobClient.getFs();
SequenceFile.Writer writer = SequenceFile.createWriter(fileSystem,
jobConf, path, Text.class, Text.class);

</code>

After that, I can write all Text-based keys and values by doing this:

<code>
Text keyText = new Text();
keyText.set("mykey");

Text valText = new Text();
valText.set("myval");

writer.append(keyText, valText);
</code>

As you can see, there is no LongWriteable what-so-ever.

However, in a map/reduce job, if I am to specify
<code>
jobConf.setOutputFormat(SequenceFileOutputFormat.class);
</code>

And later in the mapper, if I am to say
<code>
Text newkey = new Text();
newkey.set("AAA");

Text newval = new Text();
newval.set("bbb");

output.collect(newkey, newval);
</code>

It would throw an exception, complaining that the key is not LongWriteable.

So that's a part of the reason that I am having trouble connecting the
pipes - it seems to me that SequenceFile and SequenceFileOutputFormat
are talking about two different kinds of "sequence files"...

Re: question on file, inputformats and outputformats

Posted by Ted Dunning <td...@veoh.com>.

Part of your problem is that you appear to be using a TextInputFormat (the
default input format).  The TIF produces keys that are LongWritable and
values that are Text.

Other input formats produce different types.

With recent versions of hadoop, classes that extend InputFormatBase can (and
I think should) use templates to describe their output types.  Similarly,
classes extending MapReduceBase and OutputFormat can specify input/output
classes and output classes respectively.

I have added more specific comments in-line.

On 12/17/07 5:40 PM, "Jim the Standing Bear" <st...@gmail.com> wrote:


> 1. Pass in a string to my hadoop program, and it will write this
> single key-value pair to a file on the fly.

How is your string a key-value pair?

Assuming that you have something as simple as tab-delimited text, you may
not need to do anything at all other than just copy this data into hadoop.

> 2. The first job will read from this file, do some processing, and
> write more key-value pairs to other files (the same format as the file
> in step 1). Subsequent jobs will read from those files generated by
> the first job. This will continue in an iterative manner until some
> terminal condition has reached.

Can you be more specific?

Let's assume that you are reading tab-delimited data.  You should set the
input format:

        conf.setInputFormat(TextInputFormat.class);

Then, since the output of your map will have a string key and value, you
should tell the system this:

       step1.setOutputKeyClass(Text.class);
       step1.setOutputValueClass(Text.class);

Note that the signature on your map function should be:

   public static class JoinMap extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {
            ...

        public void map(LongWritable k, Text input,
                        OutputCollector<Text, Text> output,
                        Reporter reporter) throws IOException {
            String[] parts = input.split("\t");

            Text key, result;
                ...
            output.collect(key, result);
        }
    }

And your reduce should look something like this:

    public static class JoinReduce extends MapReduceBase implements
            Reducer<Text, Text, Text, Mumble> {

        public void reduce(Text k, Iterator<Text> values,
                           OutputCollector<Text, Mumble> output,
                           Reporter reporter) throws IOException {
            Text key;
            Mumble result;
                ....
            output.collect(key, result);
        }
    }


> KeyValueTextInputFormat looks promising

This could work, depending on what data you have for input.  Set the
separator byte to be whatever separates your key from your value and off you
go.