You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stan Rosenberg <sr...@proclivitysystems.com> on 2011/08/13 17:14:31 UTC

WritableComparable and the case of duplicate keys in the reducer

Hi All,

Here is what's happening.  I have implemented my own WritableComparable keys
and values.
Inside a reducer I am seeing 'reduce'  being invoked with the "same" key
_twice_.
I have checked that context.getKeyComparator() and
context.getSortComparator() are both WritableComparator which
indicates that 'compareTo' method of my key should be called when doing
reduce-side merge.

Indeed, inside the 'reduce' method I captured both key instances and did the
following checks:

((WritableComparator)context.getKeyComparator()).compare((Object)key1,
(Object)key2)
((WritableComparator)context.getSortComparator()).compare((Object)key2,
(Object)key2)

In both calls, the result is '0', confirming that key1 and key2 are
equivalent.

So, what is going on?

Note that key1 and key2 come from different mappers but they should have
been collapsed in the reducer since
they are both equal according to WritableComparator.  Also note that key1
and key2 are not bitwise equivalent, but
that shouldn't matter, or should it?

Many thanks in advance!

stan

Re: WritableComparable and the case of duplicate keys in the reducer

Posted by William Kinney <wi...@gmail.com>.

Naturally after I send that email I find that I am wrong. I was also using
an enum field, which was the culprit.

On Tue, Jan 10, 2012 at 6:13 PM, William Kinney <wi...@gmail.com>wrote:

> I'm (unfortunately) aware of this and this isn't the issue. My key object
> contains only long, int and String values.
>
> The job map output is consistent, but the reduce input groups and values
> for the key vary from one job to the next on the same input. It's like it
> isn't properly comparing and partitioning the keys.
>
> I have properly implemented a hashCode(), equals() and the
> WritableComparable methods.
>
> Also not surprisingly when I use 1 reduce task, the output is correct.
>
>
> On Tue, Jan 10, 2012 at 5:58 PM, W.P. McNeill <bi...@gmail.com> wrote:
>
>> The Hadoop framework reuses Writable objects for key and value arguments,
>> so if your code stores a pointer to that object instead of copying it you
>> can find yourself with mysterious duplicate objects.  This has tripped me
>> up a number of times. Details on what exactly I encountered and how I
>> fixed
>> it are here
>>
>> http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
>> and
>> here
>>
>> http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
>>  .
>>
>
>

Re: WritableComparable and the case of duplicate keys in the reducer

Posted by William Kinney <wi...@gmail.com>.

I'm (unfortunately) aware of this and this isn't the issue. My key object
contains only long, int and String values.

The job map output is consistent, but the reduce input groups and values
for the key vary from one job to the next on the same input. It's like it
isn't properly comparing and partitioning the keys.

I have properly implemented a hashCode(), equals() and the
WritableComparable methods.

Also not surprisingly when I use 1 reduce task, the output is correct.

On Tue, Jan 10, 2012 at 5:58 PM, W.P. McNeill <bi...@gmail.com> wrote:

> The Hadoop framework reuses Writable objects for key and value arguments,
> so if your code stores a pointer to that object instead of copying it you
> can find yourself with mysterious duplicate objects.  This has tripped me
> up a number of times. Details on what exactly I encountered and how I fixed
> it are here
>
> http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
> and
> here
>
> http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
>  .
>

Re: WritableComparable and the case of duplicate keys in the reducer

Posted by "W.P. McNeill" <bi...@gmail.com>.

The Hadoop framework reuses Writable objects for key and value arguments,
so if your code stores a pointer to that object instead of copying it you
can find yourself with mysterious duplicate objects.  This has tripped me
up a number of times. Details on what exactly I encountered and how I fixed
it are here
http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
and
here
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
 .

Re: WritableComparable and the case of duplicate keys in the reducer

Posted by William Kinney <wi...@gmail.com>.

I have noticed this too with one job. Keys that are equal (.equals(),
hashCode() === and compareTo === 0) are being sent to multiple reduce tasks
therefore resulting in incorrect output.

Any insight?


On Sat, Aug 13, 2011 at 11:14 AM, Stan Rosenberg <
srosenberg@proclivitysystems.com> wrote:

> Hi All,
>
> Here is what's happening.  I have implemented my own WritableComparable
> keys
> and values.
> Inside a reducer I am seeing 'reduce'  being invoked with the "same" key
> _twice_.
> I have checked that context.getKeyComparator() and
> context.getSortComparator() are both WritableComparator which
> indicates that 'compareTo' method of my key should be called when doing
> reduce-side merge.
>
> Indeed, inside the 'reduce' method I captured both key instances and did
> the
> following checks:
>
> ((WritableComparator)context.getKeyComparator()).compare((Object)key1,
> (Object)key2)
> ((WritableComparator)context.getSortComparator()).compare((Object)key2,
> (Object)key2)
>
> In both calls, the result is '0', confirming that key1 and key2 are
> equivalent.
>
> So, what is going on?
>
> Note that key1 and key2 come from different mappers but they should have
> been collapsed in the reducer since
> they are both equal according to WritableComparator.  Also note that key1
> and key2 are not bitwise equivalent, but
> that shouldn't matter, or should it?
>
> Many thanks in advance!
>
> stan
>