You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Grey <ja...@gmail.com> on 2008/09/16 18:46:32 UTC

Using JavaSerialzation and SequenceFileInput

I'm trying to use JavaSerialization for a series of MapReduce jobs, and when
it comes to reading a SequenceFile using SequenceFileInputFormat with
JavaSerialized objects, something breaks down.

I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
io.serializations property in my config, and using native java types in my
mapper and reducer implementations, like so:

MyMapper implements Mapper<String,MyObject,String,MyObject>
MyReducer implements Reducer<String,MyObject,String,MyObject>

in my job configuration, i"m doing this:

conf.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(conf, path1, path2);
conf.setOutputFormat(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(conf, path3);
conf.setOutputKeyClass(String.class);
conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
conf.setOutputValueClass(MyObject.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);

When I run the job, and output the keys & values from the mapper to
System.out, it doesn't seem like the key & value are getting populated
correctly - the key is NULL, and the value is a new, empty instance of
MyObject.

The files this job is reading were output by another job that used a custom
InputFormat, and so it didn't have the same problem, and I have validated
using a SequenceFile.Reader that the data is actually there, and non-null.
One strange thing i had to do to get the reader to work is this (see
*BOLD*text - I had to add that in order for the values to show up - I
think this
may have something to do with why SequenceFileInputFormat is having trouble
as well...)

String key = new String();
while (*(key = (String) *r.next(key)) != null) {
     HeadlineDocument value = new HeadlineDocument();
     *value = (HeadlineDocument) *r.getCurrentValue(value);
     System.out.println("Key: " + key.toString());
     System.out.println("Value: " + value.toString());
}

Anyone got any hints as to how one uses JavaSerialization properly in the
INPUT phase of a MapReduce job?

Thanks for any help

-jg-

Re: Using JavaSerialzation and SequenceFileInput

Posted by Jason Grey <ja...@gmail.com>.

Cool, thanks for the answer.

On Wed, Sep 17, 2008 at 12:35 PM, Owen O'Malley <om...@apache.org> wrote:

> The problem is that the Java serialization works for SequenceFile, but
> doesn't work with RecordReader. The problem is that Java serialization
> always returns a new object and the RecordReader interface looks like:
>
> boolean next(Object key, Object value) throws IOException;
>
> where the outer context needs to pass in the object. Note that we are
> working on fixing it, but it will require HADOOP-1230, which creates a
> completely new API.
>
> -- Owen
>

Re: Using JavaSerialzation and SequenceFileInput

Posted by Owen O'Malley <om...@apache.org>.

The problem is that the Java serialization works for SequenceFile, but  
doesn't work with RecordReader. The problem is that Java serialization  
always returns a new object and the RecordReader interface looks like:

boolean next(Object key, Object value) throws IOException;

where the outer context needs to pass in the object. Note that we are  
working on fixing it, but it will require HADOOP-1230, which creates a  
completely new API.

-- Owen

Re: Using JavaSerialzation and SequenceFileInput

Posted by Jason Grey <ja...@gmail.com>.

I read HADOOP-3413 <https://issues.apache.org/jira/browse/HADOOP-3413> a bit
more closely - it updates SequenceFile.Reader, not SequenceFileInputFormat,
which is what M.R. framework uses... looks like you have to write your own
input format, or have your mappers/reducers take raw bytes, and deserialize
within...


On Wed, Sep 17, 2008 at 9:04 AM, Jason Grey <ja...@gmail.com>wrote:

> I just found this one this morning, looks like a fix should be in 0.18.0
> according to the bug tracker:
>
> https://issues.apache.org/jira/browse/HADOOP-3413
>
> I'm going to go double check all my code, as I'm pretty sure I am on 0.18.0
> already
>
> -jg-
>
>
>
> On Tue, Sep 16, 2008 at 9:10 PM, Alex Loddengaard <al...@google.com>wrote:
>
>> Unfortunately I don't know of a solution to your problem, but I've been
>> experiencing the exact same issues while trying to implement a Protocol
>> Buffer serialization.  Take a look:
>>
>> <https://issues.apache.org/jira/browse/HADOOP-3788>
>>
>> I hope this helps others to diagnose your problem.
>>
>> Alex
>>
>> On Wed, Sep 17, 2008 at 12:47 AM, Jason Grey <jason.grey.work@gmail.com
>> >wrote:
>>
>> > *HeadlineDocument *in the code below is equivalent to *MyObject* - I
>> forgot
>> > to obfuscate that one... opps...
>> >
>> > On Tue, Sep 16, 2008 at 11:46 AM, Jason Grey <jason.grey.work@gmail.com
>> > >wrote:
>> >
>> > > I'm trying to use JavaSerialization for a series of MapReduce jobs,
>> and
>> > > when it comes to reading a SequenceFile using SequenceFileInputFormat
>> > with
>> > > JavaSerialized objects, something breaks down.
>> > >
>> > > I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
>> > > io.serializations property in my config, and using native java types
>> in
>> > my
>> > > mapper and reducer implementations, like so:
>> > >
>> > > MyMapper implements Mapper<String,MyObject,String,MyObject>
>> > > MyReducer implements Reducer<String,MyObject,String,MyObject>
>> > >
>> > > in my job configuration, i"m doing this:
>> > >
>> > > conf.setInputFormat(SequenceFileInputFormat.class);
>> > > FileInputFormat.setInputPaths(conf, path1, path2);
>> > > conf.setOutputFormat(SequenceFileOutputFormat.class);
>> > > FileOutputFormat.setOutputPath(conf, path3);
>> > > conf.setOutputKeyClass(String.class);
>> > > conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
>> > > conf.setOutputValueClass(MyObject.class);
>> > > conf.setMapperClass(MyMapper.class);
>> > > conf.setReducerClass(MyReducer.class);
>> > >
>> > > When I run the job, and output the keys & values from the mapper to
>> > > System.out, it doesn't seem like the key & value are getting populated
>> > > correctly - the key is NULL, and the value is a new, empty instance of
>> > > MyObject.
>> > >
>> > > The files this job is reading were output by another job that used a
>> > custom
>> > > InputFormat, and so it didn't have the same problem, and I have
>> validated
>> > > using a SequenceFile.Reader that the data is actually there, and
>> > non-null.
>> > > One strange thing i had to do to get the reader to work is this (see
>> > *BOLD
>> > > * text - I had to add that in order for the values to show up - I
>> think
>> > > this may have something to do with why SequenceFileInputFormat is
>> having
>> > > trouble as well...)
>> > >
>> > > String key = new String();
>> > > while (*(key = (String) *r.next(key)) != null) {
>> > >      HeadlineDocument value = new HeadlineDocument();
>> > >      *value = (HeadlineDocument) *r.getCurrentValue(value);
>> > >      System.out.println("Key: " + key.toString());
>> > >      System.out.println("Value: " + value.toString());
>> > > }
>> > >
>> > > Anyone got any hints as to how one uses JavaSerialization properly in
>> the
>> > > INPUT phase of a MapReduce job?
>> > >
>> > > Thanks for any help
>> > >
>> > > -jg-
>> > >
>> >
>>
>
>

Re: Using JavaSerialzation and SequenceFileInput

Posted by Jason Grey <ja...@gmail.com>.

I just found this one this morning, looks like a fix should be in 0.18.0
according to the bug tracker:

https://issues.apache.org/jira/browse/HADOOP-3413

I'm going to go double check all my code, as I'm pretty sure I am on 0.18.0
already

-jg-


On Tue, Sep 16, 2008 at 9:10 PM, Alex Loddengaard <al...@google.com>wrote:

> Unfortunately I don't know of a solution to your problem, but I've been
> experiencing the exact same issues while trying to implement a Protocol
> Buffer serialization.  Take a look:
>
> <https://issues.apache.org/jira/browse/HADOOP-3788>
>
> I hope this helps others to diagnose your problem.
>
> Alex
>
> On Wed, Sep 17, 2008 at 12:47 AM, Jason Grey <jason.grey.work@gmail.com
> >wrote:
>
> > *HeadlineDocument *in the code below is equivalent to *MyObject* - I
> forgot
> > to obfuscate that one... opps...
> >
> > On Tue, Sep 16, 2008 at 11:46 AM, Jason Grey <jason.grey.work@gmail.com
> > >wrote:
> >
> > > I'm trying to use JavaSerialization for a series of MapReduce jobs, and
> > > when it comes to reading a SequenceFile using SequenceFileInputFormat
> > with
> > > JavaSerialized objects, something breaks down.
> > >
> > > I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
> > > io.serializations property in my config, and using native java types in
> > my
> > > mapper and reducer implementations, like so:
> > >
> > > MyMapper implements Mapper<String,MyObject,String,MyObject>
> > > MyReducer implements Reducer<String,MyObject,String,MyObject>
> > >
> > > in my job configuration, i"m doing this:
> > >
> > > conf.setInputFormat(SequenceFileInputFormat.class);
> > > FileInputFormat.setInputPaths(conf, path1, path2);
> > > conf.setOutputFormat(SequenceFileOutputFormat.class);
> > > FileOutputFormat.setOutputPath(conf, path3);
> > > conf.setOutputKeyClass(String.class);
> > > conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
> > > conf.setOutputValueClass(MyObject.class);
> > > conf.setMapperClass(MyMapper.class);
> > > conf.setReducerClass(MyReducer.class);
> > >
> > > When I run the job, and output the keys & values from the mapper to
> > > System.out, it doesn't seem like the key & value are getting populated
> > > correctly - the key is NULL, and the value is a new, empty instance of
> > > MyObject.
> > >
> > > The files this job is reading were output by another job that used a
> > custom
> > > InputFormat, and so it didn't have the same problem, and I have
> validated
> > > using a SequenceFile.Reader that the data is actually there, and
> > non-null.
> > > One strange thing i had to do to get the reader to work is this (see
> > *BOLD
> > > * text - I had to add that in order for the values to show up - I think
> > > this may have something to do with why SequenceFileInputFormat is
> having
> > > trouble as well...)
> > >
> > > String key = new String();
> > > while (*(key = (String) *r.next(key)) != null) {
> > >      HeadlineDocument value = new HeadlineDocument();
> > >      *value = (HeadlineDocument) *r.getCurrentValue(value);
> > >      System.out.println("Key: " + key.toString());
> > >      System.out.println("Value: " + value.toString());
> > > }
> > >
> > > Anyone got any hints as to how one uses JavaSerialization properly in
> the
> > > INPUT phase of a MapReduce job?
> > >
> > > Thanks for any help
> > >
> > > -jg-
> > >
> >
>

Re: Using JavaSerialzation and SequenceFileInput

Posted by Alex Loddengaard <al...@google.com>.

Unfortunately I don't know of a solution to your problem, but I've been
experiencing the exact same issues while trying to implement a Protocol
Buffer serialization.  Take a look:

<https://issues.apache.org/jira/browse/HADOOP-3788>

I hope this helps others to diagnose your problem.

Alex

On Wed, Sep 17, 2008 at 12:47 AM, Jason Grey <ja...@gmail.com>wrote:

> *HeadlineDocument *in the code below is equivalent to *MyObject* - I forgot
> to obfuscate that one... opps...
>
> On Tue, Sep 16, 2008 at 11:46 AM, Jason Grey <jason.grey.work@gmail.com
> >wrote:
>
> > I'm trying to use JavaSerialization for a series of MapReduce jobs, and
> > when it comes to reading a SequenceFile using SequenceFileInputFormat
> with
> > JavaSerialized objects, something breaks down.
> >
> > I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
> > io.serializations property in my config, and using native java types in
> my
> > mapper and reducer implementations, like so:
> >
> > MyMapper implements Mapper<String,MyObject,String,MyObject>
> > MyReducer implements Reducer<String,MyObject,String,MyObject>
> >
> > in my job configuration, i"m doing this:
> >
> > conf.setInputFormat(SequenceFileInputFormat.class);
> > FileInputFormat.setInputPaths(conf, path1, path2);
> > conf.setOutputFormat(SequenceFileOutputFormat.class);
> > FileOutputFormat.setOutputPath(conf, path3);
> > conf.setOutputKeyClass(String.class);
> > conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
> > conf.setOutputValueClass(MyObject.class);
> > conf.setMapperClass(MyMapper.class);
> > conf.setReducerClass(MyReducer.class);
> >
> > When I run the job, and output the keys & values from the mapper to
> > System.out, it doesn't seem like the key & value are getting populated
> > correctly - the key is NULL, and the value is a new, empty instance of
> > MyObject.
> >
> > The files this job is reading were output by another job that used a
> custom
> > InputFormat, and so it didn't have the same problem, and I have validated
> > using a SequenceFile.Reader that the data is actually there, and
> non-null.
> > One strange thing i had to do to get the reader to work is this (see
> *BOLD
> > * text - I had to add that in order for the values to show up - I think
> > this may have something to do with why SequenceFileInputFormat is having
> > trouble as well...)
> >
> > String key = new String();
> > while (*(key = (String) *r.next(key)) != null) {
> >      HeadlineDocument value = new HeadlineDocument();
> >      *value = (HeadlineDocument) *r.getCurrentValue(value);
> >      System.out.println("Key: " + key.toString());
> >      System.out.println("Value: " + value.toString());
> > }
> >
> > Anyone got any hints as to how one uses JavaSerialization properly in the
> > INPUT phase of a MapReduce job?
> >
> > Thanks for any help
> >
> > -jg-
> >
>

Re: Using JavaSerialzation and SequenceFileInput

Posted by Jason Grey <ja...@gmail.com>.

*HeadlineDocument *in the code below is equivalent to *MyObject* - I forgot
to obfuscate that one... opps...

On Tue, Sep 16, 2008 at 11:46 AM, Jason Grey <ja...@gmail.com>wrote:

> I'm trying to use JavaSerialization for a series of MapReduce jobs, and
> when it comes to reading a SequenceFile using SequenceFileInputFormat with
> JavaSerialized objects, something breaks down.
>
> I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
> io.serializations property in my config, and using native java types in my
> mapper and reducer implementations, like so:
>
> MyMapper implements Mapper<String,MyObject,String,MyObject>
> MyReducer implements Reducer<String,MyObject,String,MyObject>
>
> in my job configuration, i"m doing this:
>
> conf.setInputFormat(SequenceFileInputFormat.class);
> FileInputFormat.setInputPaths(conf, path1, path2);
> conf.setOutputFormat(SequenceFileOutputFormat.class);
> FileOutputFormat.setOutputPath(conf, path3);
> conf.setOutputKeyClass(String.class);
> conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
> conf.setOutputValueClass(MyObject.class);
> conf.setMapperClass(MyMapper.class);
> conf.setReducerClass(MyReducer.class);
>
> When I run the job, and output the keys & values from the mapper to
> System.out, it doesn't seem like the key & value are getting populated
> correctly - the key is NULL, and the value is a new, empty instance of
> MyObject.
>
> The files this job is reading were output by another job that used a custom
> InputFormat, and so it didn't have the same problem, and I have validated
> using a SequenceFile.Reader that the data is actually there, and non-null.
> One strange thing i had to do to get the reader to work is this (see *BOLD
> * text - I had to add that in order for the values to show up - I think
> this may have something to do with why SequenceFileInputFormat is having
> trouble as well...)
>
> String key = new String();
> while (*(key = (String) *r.next(key)) != null) {
>      HeadlineDocument value = new HeadlineDocument();
>      *value = (HeadlineDocument) *r.getCurrentValue(value);
>      System.out.println("Key: " + key.toString());
>      System.out.println("Value: " + value.toString());
> }
>
> Anyone got any hints as to how one uses JavaSerialization properly in the
> INPUT phase of a MapReduce job?
>
> Thanks for any help
>
> -jg-
>