You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marco Didonna <m....@gmail.com> on 2011/02/03 18:21:32 UTC

Reducer getting key-value pairs in wrong order

Hello,
I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:

(term,document_id) -> (section_of_the_document,positional frequency vector)

example

(apple,12) -> (title,[1,3])

The reducer should bring together the same terms and create a posting
list like:

apple -> (12,title,[1,3]) , (14,body,[2,5]) ...

... -> ...

To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key[1] into the same posting list
as in the example.

Testing my system on a tiny dataset made up of two document (same
content) I get:

minni	[(1,body,[1,2])]
pippo	[(1,body,[2,0,3])]
pluto	[(1,body,[1,1])]
minni	[(2,body,[1,2])]
pippo	[(2,body,[1,0])]
pluto	[(2,body,[1,1])]

The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:

minni
[(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]


One single key (the first one) and all postings associated with
it...what do I miss??

Thanks for your time

Marco

[1] by "same key" I mean those who have the same left element

Re: Reducer getting key-value pairs in wrong order

Posted by Marco Didonna <m....@gmail.com>.

On 02/04/2011 10:11 AM, Marco Didonna wrote:
> On 02/03/2011 07:02 PM, Harsh J wrote:
>> For a ValueGrouping comparator to work, your Partitioner must act in
>> tandem with it. I do not know if you have implemented a custom
>> hashCode() method for your Key class, but your partitioner should look
>> like:
> 
> Yes I did and it works like this return "leftElement.hashCode() +
> rightElement; "
> 
>>
>> return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
>>
> 
> This was definitely a bug, the result is always the same though :(
> 
>> This will ensure that the to-be grouped data is actually partitioned
>> properly too.
>>
>> The actual sorting (which ought to occur for the full composite key
>> field-by-field, and is the only real 'sorter') would be handled by the
>> compare() call of your Writable, if you are using a
>> WritableComparable.
> 
> I am using a WritableComparable...here's PairOfStringInt
> https://gist.github.com/810905
> 
> Thanks again
> 
> 


I finally made it https://gist.github.com/809803 I use the
groupingComparator as job.setSortComparatorClass(GroupingComparator.class)

I still do not understand what was wrong with the old version of the
GroupingComparator and when the key are ordered according to the policy
encoded in GroupingComparator.

MD

Re: Reducer getting key-value pairs in wrong order

Posted by Marco Didonna <m....@gmail.com>.

On 02/03/2011 07:02 PM, Harsh J wrote:
> For a ValueGrouping comparator to work, your Partitioner must act in
> tandem with it. I do not know if you have implemented a custom
> hashCode() method for your Key class, but your partitioner should look
> like:

Yes I did and it works like this return "leftElement.hashCode() +
rightElement; "

> 
> return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
> 

This was definitely a bug, the result is always the same though :(

> This will ensure that the to-be grouped data is actually partitioned
> properly too.
> 
> The actual sorting (which ought to occur for the full composite key
> field-by-field, and is the only real 'sorter') would be handled by the
> compare() call of your Writable, if you are using a
> WritableComparable.

I am using a WritableComparable...here's PairOfStringInt
https://gist.github.com/810905

Thanks again

Re: Reducer getting key-value pairs in wrong order

Posted by Harsh J <qw...@gmail.com>.

For a ValueGrouping comparator to work, your Partitioner must act in
tandem with it. I do not know if you have implemented a custom
hashCode() method for your Key class, but your partitioner should look
like:

return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;

This will ensure that the to-be grouped data is actually partitioned
properly too.

The actual sorting (which ought to occur for the full composite key
field-by-field, and is the only real 'sorter') would be handled by the
compare() call of your Writable, if you are using a
WritableComparable.

On Thu, Feb 3, 2011 at 10:51 PM, Marco Didonna <m....@gmail.com> wrote:
> Hello,
> I am writing a little hadoop program to index a bunch (large bunch) of
> text files joined together in a large xml file. The mapper execute some
> basic text preprocessing and emits key-value pair like:
>
> (term,document_id) -> (section_of_the_document,positional frequency vector)
>
> example
>
> (apple,12) -> (title,[1,3])
>
> The reducer should bring together the same terms and create a posting
> list like:
>
> apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
>
> ... -> ...
>
> To accomplish this I have created a custom class PairOfStringInt to hold
> mapper's key which implements writableComparable, a custom partitioner
> TermPartioner (https://gist.github.com/809793) and a Reducer which
> should bring all values from the same key[1] into the same posting list
> as in the example.
>
> Testing my system on a tiny dataset made up of two document (same
> content) I get:
>
> minni   [(1,body,[1,2])]
> pippo   [(1,body,[2,0,3])]
> pluto   [(1,body,[1,1])]
> minni   [(2,body,[1,2])]
> pippo   [(2,body,[1,0])]
> pluto   [(2,body,[1,1])]
>
> The values from the same key are not brought together...Looking at the
> secondary sort example I also tried to implement a
> GroupComparator(https://gist.github.com/809803) to be set on the job
> using job.setGroupingComparatorClass(GroupingComparator.class) but if I
> do so I get in the output:
>
> minni
> [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]
>
>
> One single key (the first one) and all postings associated with
> it...what do I miss??
>
> Thanks for your time
>
> Marco
>
> [1] by "same key" I mean those who have the same left element
>
>



-- 
Harsh J
www.harshj.com