You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Zhang, jian" <jz...@freewheel.tv> on 2008/04/11 13:16:29 UTC

答复: Problem with key aggregation when number of reduce tasks is more than 1

Hi,

Please read this, you need to implement partitioner.
It controls which key is sent to which reducer, if u want to get unique key result, you need to implement partitioner and the compareTO function should work properly. 
[WIKI]
Partitioner

Partitioner partitions the key space.

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

HashPartitioner is the default Partitioner.



Best Regards

Jian Zhang


-----邮件原件-----
发件人: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com] 
发送时间: 2008年4月11日 19:06
收件人: core-user@hadoop.apache.org
主题: Problem with key aggregation when number of reduce tasks is more than 1

Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-00000, output/part-00001, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/

Re: 答复: Problem with key aggregation when number of reduce tasks is more than 1

Posted by Harish Mallipeddi <ha...@gmail.com>.
Hey thanks a lot. That's basically what I needed.

2008/4/11 Zhang, jian <jz...@freewheel.tv>:

> Hi,
>
> Please read this, you need to implement partitioner.
> It controls which key is sent to which reducer, if u want to get unique
> key result, you need to implement partitioner and the compareTO function
> should work properly.
> [WIKI]
> Partitioner
>
> Partitioner partitions the key space.
>
> Partitioner controls the partitioning of the keys of the intermediate
> map-outputs. The key (or a subset of the key) is used to derive the
> partition, typically by a hash function. The total number of partitions is
> the same as the number of reduce tasks for the job. Hence this controls
> which of the m reduce tasks the intermediate key (and hence the record) is
> sent to for reduction.
>
> HashPartitioner is the default Partitioner.
>
>
>
> Best Regards
>
> Jian Zhang
>
>
> -----邮件原件-----
> 发件人: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com]
> 发送时间: 2008年4月11日 19:06
> 收件人: core-user@hadoop.apache.org
> 主题: Problem with key aggregation when number of reduce tasks is more than
> 1
>
> Hi all,
>
> I wrote a custom key class (implements WritableComparable) and implemented
> the compareTo() method inside this class. Everything works fine when I run
> the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
> correctly in the output files.
>
> But when I increase the number of reduce tasks, keys don't get aggregated
> properly; same keys seem to end up in separate output files
> (output/part-00000, output/part-00001, etc). This should not happen
> because
> right before reduce() gets called, all (k,v) pairs from all map outputs
> with
> the same 'k' are aggregated and the reduce function just iterates over the
> values (v1, v2, etc)?
>
> Do I need to implement anything else inside my custom key class other than
> compareTo? I also tried implementing equals() but that didn't help either.
> Then I came across setOutputKeyComparator(). So I added a custom
> Comparator
> class inside the key class and tried setting this on the JobConf object.
> But
> that didn't work either. What could be wrong?
>
> Cheers,
>
> --
> Harish Mallipeddi
> circos.com : poundbang.in/blog/
>



-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/

Re: 答复: Problem with key aggregation when number of reduce tasks is more than 1

Posted by Pete Wyckoff <pw...@facebook.com>.
Yes and as such, we've found better load balancing when the #of reduces is a
prime #.  Although the string.hashCode isn't great for short strings.


On 4/11/08 4:16 AM, "Zhang, jian" <jz...@freewheel.tv> wrote:

> Hi,
> 
> Please read this, you need to implement partitioner.
> It controls which key is sent to which reducer, if u want to get unique key
> result, you need to implement partitioner and the compareTO function should
> work properly. 
> [WIKI]
> Partitioner
> 
> Partitioner partitions the key space.
> 
> Partitioner controls the partitioning of the keys of the intermediate
> map-outputs. The key (or a subset of the key) is used to derive the partition,
> typically by a hash function. The total number of partitions is the same as
> the number of reduce tasks for the job. Hence this controls which of the m
> reduce tasks the intermediate key (and hence the record) is sent to for
> reduction.
> 
> HashPartitioner is the default Partitioner.
> 
> 
> 
> Best Regards
> 
> Jian Zhang
> 
> 
> -----邮件原件-----
> 发件人: Harish Mallipeddi [mailto:harish.mallipeddi@gmail.com]
> 发送时间: 2008年4月11日 19:06
> 收件人: core-user@hadoop.apache.org
> 主题: Problem with key aggregation when number of reduce tasks is more than 1
> 
> Hi all,
> 
> I wrote a custom key class (implements WritableComparable) and implemented
> the compareTo() method inside this class. Everything works fine when I run
> the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
> correctly in the output files.
> 
> But when I increase the number of reduce tasks, keys don't get aggregated
> properly; same keys seem to end up in separate output files
> (output/part-00000, output/part-00001, etc). This should not happen because
> right before reduce() gets called, all (k,v) pairs from all map outputs with
> the same 'k' are aggregated and the reduce function just iterates over the
> values (v1, v2, etc)?
> 
> Do I need to implement anything else inside my custom key class other than
> compareTo? I also tried implementing equals() but that didn't help either.
> Then I came across setOutputKeyComparator(). So I added a custom Comparator
> class inside the key class and tried setting this on the JobConf object. But
> that didn't work either. What could be wrong?
> 
> Cheers,