You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Marco Didonna <m....@gmail.com> on 2011/02/03 18:21:32 UTC
Reducer getting key-value pairs in wrong order
Hello,
I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:
(term,document_id) -> (section_of_the_document,positional frequency vector)
example
(apple,12) -> (title,[1,3])
The reducer should bring together the same terms and create a posting
list like:
apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
... -> ...
To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key[1] into the same posting list
as in the example.
Testing my system on a tiny dataset made up of two document (same
content) I get:
minni [(1,body,[1,2])]
pippo [(1,body,[2,0,3])]
pluto [(1,body,[1,1])]
minni [(2,body,[1,2])]
pippo [(2,body,[1,0])]
pluto [(2,body,[1,1])]
The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:
minni
[(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]
One single key (the first one) and all postings associated with
it...what do I miss??
Thanks for your time
Marco
[1] by "same key" I mean those who have the same left element
Re: Reducer getting key-value pairs in wrong order
Posted by Marco Didonna <m....@gmail.com>.
On 02/04/2011 10:11 AM, Marco Didonna wrote:
> On 02/03/2011 07:02 PM, Harsh J wrote:
>> For a ValueGrouping comparator to work, your Partitioner must act in
>> tandem with it. I do not know if you have implemented a custom
>> hashCode() method for your Key class, but your partitioner should look
>> like:
>
> Yes I did and it works like this return "leftElement.hashCode() +
> rightElement; "
>
>>
>> return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
>>
>
> This was definitely a bug, the result is always the same though :(
>
>> This will ensure that the to-be grouped data is actually partitioned
>> properly too.
>>
>> The actual sorting (which ought to occur for the full composite key
>> field-by-field, and is the only real 'sorter') would be handled by the
>> compare() call of your Writable, if you are using a
>> WritableComparable.
>
> I am using a WritableComparable...here's PairOfStringInt
> https://gist.github.com/810905
>
> Thanks again
>
>
I finally made it https://gist.github.com/809803 I use the
groupingComparator as job.setSortComparatorClass(GroupingComparator.class)
I still do not understand what was wrong with the old version of the
GroupingComparator and when the key are ordered according to the policy
encoded in GroupingComparator.
MD
Re: Reducer getting key-value pairs in wrong order
Posted by Marco Didonna <m....@gmail.com>.
On 02/03/2011 07:02 PM, Harsh J wrote:
> For a ValueGrouping comparator to work, your Partitioner must act in
> tandem with it. I do not know if you have implemented a custom
> hashCode() method for your Key class, but your partitioner should look
> like:
Yes I did and it works like this return "leftElement.hashCode() +
rightElement; "
>
> return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
>
This was definitely a bug, the result is always the same though :(
> This will ensure that the to-be grouped data is actually partitioned
> properly too.
>
> The actual sorting (which ought to occur for the full composite key
> field-by-field, and is the only real 'sorter') would be handled by the
> compare() call of your Writable, if you are using a
> WritableComparable.
I am using a WritableComparable...here's PairOfStringInt
https://gist.github.com/810905
Thanks again
Re: Reducer getting key-value pairs in wrong order
Posted by Harsh J <qw...@gmail.com>.
For a ValueGrouping comparator to work, your Partitioner must act in
tandem with it. I do not know if you have implemented a custom
hashCode() method for your Key class, but your partitioner should look
like:
return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
This will ensure that the to-be grouped data is actually partitioned
properly too.
The actual sorting (which ought to occur for the full composite key
field-by-field, and is the only real 'sorter') would be handled by the
compare() call of your Writable, if you are using a
WritableComparable.
On Thu, Feb 3, 2011 at 10:51 PM, Marco Didonna <m....@gmail.com> wrote:
> Hello,
> I am writing a little hadoop program to index a bunch (large bunch) of
> text files joined together in a large xml file. The mapper execute some
> basic text preprocessing and emits key-value pair like:
>
> (term,document_id) -> (section_of_the_document,positional frequency vector)
>
> example
>
> (apple,12) -> (title,[1,3])
>
> The reducer should bring together the same terms and create a posting
> list like:
>
> apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
>
> ... -> ...
>
> To accomplish this I have created a custom class PairOfStringInt to hold
> mapper's key which implements writableComparable, a custom partitioner
> TermPartioner (https://gist.github.com/809793) and a Reducer which
> should bring all values from the same key[1] into the same posting list
> as in the example.
>
> Testing my system on a tiny dataset made up of two document (same
> content) I get:
>
> minni [(1,body,[1,2])]
> pippo [(1,body,[2,0,3])]
> pluto [(1,body,[1,1])]
> minni [(2,body,[1,2])]
> pippo [(2,body,[1,0])]
> pluto [(2,body,[1,1])]
>
> The values from the same key are not brought together...Looking at the
> secondary sort example I also tried to implement a
> GroupComparator(https://gist.github.com/809803) to be set on the job
> using job.setGroupingComparatorClass(GroupingComparator.class) but if I
> do so I get in the output:
>
> minni
> [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]
>
>
> One single key (the first one) and all postings associated with
> it...what do I miss??
>
> Thanks for your time
>
> Marco
>
> [1] by "same key" I mean those who have the same left element
>
>
--
Harsh J
www.harshj.com