You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kyle Renfro <kr...@real-comp.com> on 2011/12/20 18:31:59 UTC

Custom Writables in MapWritable

Hadoop 0.22.0-RC0

I have the following reducer:
    public static class MergeRecords extends
Reducer<Text,MapWritable,Text,MapWritable>

The MapWritables that are handled by the reducer all have Text 'keys'
and contain different 'value' classes including Text, DoubleWritable,
and a custom Writable MapArrayWritable.  The reduce works as expected
if each MapWritable contains both a DoubleWritable and
MapArrayWritable.  The reduce fails with the following exception if
some of the MapWritables contains only a DoubleWritable value:

-----------
java.lang.IllegalArgumentException: Id 1 exists but maps to
com.realcomp.data.hadoop.record.MapArrayWritable and not
org.apache.hadoop.io.DoubleWritable at
org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:75)
at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:203)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:148)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:145)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:292)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168) at
org.apache.hadoop.mapred.ReduceTask.
------------

Digging into the source a little I stumbled upon the fact that the
default constructor for AbstractMapWritable does not configure itself
to handle DoubleWritable as it does for all the other base Writables.
This looks like an omission to me, and If the DoubleWritable was
configured, I would probably never have noticed this problem, as there
would be only one custom class in the MapWritable.

Question 1:
Should I be able to reduce on MapWritables that contain different
(custom) value classes?

Question 2:
It appears the org.apache.hadoop.io.serialize.WritableSerialization
class reuses the first MapWritable instance for each deserialization.
This is probably a performance optimization, and explains why I am
getting the exception.  Is it possible
for me to register my own serialization class that would allow me to
deserialize MapWritables with different value classes?  Are there
examples of this available?


Note: I realize I am running off of a release candidate, but I thought
I would ask here first before I go through the trouble of upgrading
the cluster.

thanks,
Kyle