You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Anh Nguyen <ng...@gmail.com> on 2009/09/16 09:24:00 UTC

Need help - Uneven workload distributed to Reducers

Hi all,

I am having some trouble with distributing workload evenly to reducers.

I have 25 reducers and I intentionally created 25 different Map output keys
so that each output set will go to one Reducer.

But in practice, some Reducers get 2 sets and some does not get anything.

I wonder if there is a way to fix this. Perhaps a custom Map output class?

Any help is greatly appreciated.

Thanks,

-- 
----------------------------
Anh Nguyen
http://www.im-nguyen.com

Re: Need help - Uneven workload distributed to Reducers

Posted by Ted Dunning <te...@gmail.com>.
A better solution is to output a large number of records with keys drawn
pseudo-randomly from a large domain of values.  Then records will balance
across however many reducers you have.  Because you are balancing a larger
number of records, the degree of imbalance that happens at random will be
much smaller than if you have just a few records.

The original poster might not have that option in his problem as stated, but
it sounded like he might be able to restructure the computation a bit.

On Wed, Sep 16, 2009 at 4:08 AM, <am...@students.iiit.ac.in> wrote:

> Hi,
>
> Based on the data distribution, the hashcode generated by key.hashCode()
> could result in a large skew in the data provided to the reducer function.
> So one reducer might get a very large data set while other reducers get
> small datasets. Thereby, making the job to wait till the busiest reducer
> finishes off.
>
> Is there a way to split the partition files, based on the size of each
> partition.
>
> Thanks!
> Amol.
>
>
> > Thanks,
> >
> > I will try what you suggested.
> >
> > Best,
> >
> > On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi <
> > harish.mallipeddi@gmail.com> wrote:
> >
> >> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <nguyenminhanh@gmail.com
> >> >wrote:
> >>
> >> > Hi all,
> >> >
> >> > I am having some trouble with distributing workload evenly to
> >> reducers.
> >> >
> >> > I have 25 reducers and I intentionally created 25 different Map output
> >> keys
> >> > so that each output set will go to one Reducer.
> >> >
> >> > But in practice, some Reducers get 2 sets and some does not get
> >> anything.
> >> >
> >> > I wonder if there is a way to fix this. Perhaps a custom Map output
> >> class?
> >> >
> >> > Any help is greatly appreciated.
> >> >
> >> >
> >> The default HashPartitioner does this: (key.hashCode() &
> >> Integer.MAX_VALUE)
> >> % numReduceTasks
> >>
> >> So there's no guarantee your 25 different map-output keys would in fact
> >> end
> >> up in different partitions.
> >> Btw if you want some custom partitioning behavior, just implement the
> >> Partitioner interface in your custom Partitioner class and supply that
> >> to
> >> Hadoop (via JobConf.setPartitionerClass).
> >>
> >>
> >> --
> >> Harish Mallipeddi
> >> http://blog.poundbang.in
> >>
> >
> >
> >
> > --
> > ----------------------------
> > Anh Nguyen
> > http://www.im-nguyen.com
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: Need help - Uneven workload distributed to Reducers

Posted by am...@students.iiit.ac.in.
Hi,

Based on the data distribution, the hashcode generated by key.hashCode()
could result in a large skew in the data provided to the reducer function.
So one reducer might get a very large data set while other reducers get
small datasets. Thereby, making the job to wait till the busiest reducer
finishes off.

Is there a way to split the partition files, based on the size of each
partition.

Thanks!
Amol.


> Thanks,
>
> I will try what you suggested.
>
> Best,
>
> On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi <
> harish.mallipeddi@gmail.com> wrote:
>
>> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <nguyenminhanh@gmail.com
>> >wrote:
>>
>> > Hi all,
>> >
>> > I am having some trouble with distributing workload evenly to
>> reducers.
>> >
>> > I have 25 reducers and I intentionally created 25 different Map output
>> keys
>> > so that each output set will go to one Reducer.
>> >
>> > But in practice, some Reducers get 2 sets and some does not get
>> anything.
>> >
>> > I wonder if there is a way to fix this. Perhaps a custom Map output
>> class?
>> >
>> > Any help is greatly appreciated.
>> >
>> >
>> The default HashPartitioner does this: (key.hashCode() &
>> Integer.MAX_VALUE)
>> % numReduceTasks
>>
>> So there's no guarantee your 25 different map-output keys would in fact
>> end
>> up in different partitions.
>> Btw if you want some custom partitioning behavior, just implement the
>> Partitioner interface in your custom Partitioner class and supply that
>> to
>> Hadoop (via JobConf.setPartitionerClass).
>>
>>
>> --
>> Harish Mallipeddi
>> http://blog.poundbang.in
>>
>
>
>
> --
> ----------------------------
> Anh Nguyen
> http://www.im-nguyen.com
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: Need help - Uneven workload distributed to Reducers

Posted by Anh Nguyen <ng...@gmail.com>.
Thanks,

I will try what you suggested.

Best,

On Wed, Sep 16, 2009 at 2:59 AM, Harish Mallipeddi <
harish.mallipeddi@gmail.com> wrote:

> On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <nguyenminhanh@gmail.com
> >wrote:
>
> > Hi all,
> >
> > I am having some trouble with distributing workload evenly to reducers.
> >
> > I have 25 reducers and I intentionally created 25 different Map output
> keys
> > so that each output set will go to one Reducer.
> >
> > But in practice, some Reducers get 2 sets and some does not get anything.
> >
> > I wonder if there is a way to fix this. Perhaps a custom Map output
> class?
> >
> > Any help is greatly appreciated.
> >
> >
> The default HashPartitioner does this: (key.hashCode() & Integer.MAX_VALUE)
> % numReduceTasks
>
> So there's no guarantee your 25 different map-output keys would in fact end
> up in different partitions.
> Btw if you want some custom partitioning behavior, just implement the
> Partitioner interface in your custom Partitioner class and supply that to
> Hadoop (via JobConf.setPartitionerClass).
>
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



-- 
----------------------------
Anh Nguyen
http://www.im-nguyen.com

Re: Need help - Uneven workload distributed to Reducers

Posted by Harish Mallipeddi <ha...@gmail.com>.
On Wed, Sep 16, 2009 at 12:54 PM, Anh Nguyen <ng...@gmail.com>wrote:

> Hi all,
>
> I am having some trouble with distributing workload evenly to reducers.
>
> I have 25 reducers and I intentionally created 25 different Map output keys
> so that each output set will go to one Reducer.
>
> But in practice, some Reducers get 2 sets and some does not get anything.
>
> I wonder if there is a way to fix this. Perhaps a custom Map output class?
>
> Any help is greatly appreciated.
>
>
The default HashPartitioner does this: (key.hashCode() & Integer.MAX_VALUE)
% numReduceTasks

So there's no guarantee your 25 different map-output keys would in fact end
up in different partitions.
Btw if you want some custom partitioning behavior, just implement the
Partitioner interface in your custom Partitioner class and supply that to
Hadoop (via JobConf.setPartitionerClass).


-- 
Harish Mallipeddi
http://blog.poundbang.in