You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/07/02 13:55:14 UTC

Uneven Reduce Issue

Hi,

Recently my cluster configuration changed from 11 reducers to 12. Since then
on every job using pig only 3 reducers do actual work and output results
while the others quickly finish and output zero byte files. The result of
the entire job is OK but getting there takes longer because of the uneven
work distribution. Plain MR jobs are fine and do have even work
distribution. Can this be something in pig ?
(I'm using latest trunk)

Thanks,
Tamir

Re: Uneven Reduce Issue

Posted by Ted Dunning <te...@gmail.com>.

Pretty much just what you did is the right thing to do.

Bad hashes are just bad.  They do what you saw when you have particular
numbers of reducers.  You might convert to a long by some means other than
casting, but the basic fix is to not use bad hashes..

On Tue, Jul 7, 2009 at 12:46 AM, Tamir Kamara <ta...@gmail.com> wrote:

>  I dropped the casting of my key to long to force a real hash of the key
> values and got a nice spread of the work to all reducers.
> What should normally be done to avoid this problem?
>

Re: Uneven Reduce Issue

Posted by Tamir Kamara <ta...@gmail.com>.

Hi Alan,

The distribution is not uniform as can be seen from the image.
Long.hashCode() is simply the original number and if those aren't
distributed well the "hash" won't be either. I dropped the casting of my key
to long to force a real hash of the key values and got a nice spread of the
work to all reducers.
What should normally be done to avoid this problem?

Thanks,
Tamir


On Mon, Jul 6, 2009 at 10:41 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> What is the distribution of the key?  Is it fairly uniform, a gaussian
> distribution, or a power-law distribution?  It seems like the hash function
> is not well chosen for 12 reducers.  We use Long.hashCode() to get hash
> values, so as long as the keys are well distributed the hash code should be
> as well.
>
> Can you attach a sample of the data (or at least the keys)?
>
> Alan.
>
>
> On Jul 5, 2009, at 2:18 AM, Tamir Kamara wrote:
>
>  Hi,
>>
>> The config files are up to date and have the 12 reducers.
>> I've been able to verify that this only happens when a UDF is used. I
>> mostly
>> use eval functions. An example script:
>>
>> a01 = load 'file' as (key: long, value: int);
>> the same for a02-a31;
>> b = cogroup a01 by key, a02 by key, ..., a31 by key;
>> DEFINE MEDMAD14 pigUDF.MedMad('14');
>> c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ...,
>> a31));
>>
>> The MEDMAD function iterates over the input and produces a bag of a
>> rolling
>> median and mad according to the window value passed in the definition.
>>
>> If cogroup is followed by parallel 11 then all is fine (11 equal result
>> parts) but if I use parallel 12 then I get only 3 large files and 9 zero
>> bytes files.
>>
>> What do you think?
>>
>>
>> Thanks,
>> Tamir
>>
>>
>> On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  For most operations Pig uses the default Hadoop partitioner.  We do set
>>> our
>>> own partitioner for order by, but at the moment I believe that's it.  How
>>> are you launching Pig?  Is it possible it's picking up an old
>>> hadoop-site.xml file or something?
>>>
>>> Alan.
>>>
>>>
>>> On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> Recently my cluster configuration changed from 11 reducers to 12. Since
>>>> then
>>>> on every job using pig only 3 reducers do actual work and output results
>>>> while the others quickly finish and output zero byte files. The result
>>>> of
>>>> the entire job is OK but getting there takes longer because of the
>>>> uneven
>>>> work distribution. Plain MR jobs are fine and do have even work
>>>> distribution. Can this be something in pig ?
>>>> (I'm using latest trunk)
>>>>
>>>> Thanks,
>>>> Tamir
>>>>
>>>>
>>>
>>>
>

Re: Uneven Reduce Issue

Posted by Alan Gates <ga...@yahoo-inc.com>.

What is the distribution of the key?  Is it fairly uniform, a gaussian  
distribution, or a power-law distribution?  It seems like the hash  
function is not well chosen for 12 reducers.  We use Long.hashCode()  
to get hash values, so as long as the keys are well distributed the  
hash code should be as well.

Can you attach a sample of the data (or at least the keys)?

Alan.

On Jul 5, 2009, at 2:18 AM, Tamir Kamara wrote:

> Hi,
>
> The config files are up to date and have the 12 reducers.
> I've been able to verify that this only happens when a UDF is used.  
> I mostly
> use eval functions. An example script:
>
> a01 = load 'file' as (key: long, value: int);
> the same for a02-a31;
> b = cogroup a01 by key, a02 by key, ..., a31 by key;
> DEFINE MEDMAD14 pigUDF.MedMad('14');
> c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ...,  
> a31));
>
> The MEDMAD function iterates over the input and produces a bag of a  
> rolling
> median and mad according to the window value passed in the definition.
>
> If cogroup is followed by parallel 11 then all is fine (11 equal  
> result
> parts) but if I use parallel 12 then I get only 3 large files and 9  
> zero
> bytes files.
>
> What do you think?
>
>
> Thanks,
> Tamir
>
>
> On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates <ga...@yahoo-inc.com>  
> wrote:
>
>> For most operations Pig uses the default Hadoop partitioner.  We do  
>> set our
>> own partitioner for order by, but at the moment I believe that's  
>> it.  How
>> are you launching Pig?  Is it possible it's picking up an old
>> hadoop-site.xml file or something?
>>
>> Alan.
>>
>>
>> On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:
>>
>> Hi,
>>>
>>> Recently my cluster configuration changed from 11 reducers to 12.  
>>> Since
>>> then
>>> on every job using pig only 3 reducers do actual work and output  
>>> results
>>> while the others quickly finish and output zero byte files. The  
>>> result of
>>> the entire job is OK but getting there takes longer because of the  
>>> uneven
>>> work distribution. Plain MR jobs are fine and do have even work
>>> distribution. Can this be something in pig ?
>>> (I'm using latest trunk)
>>>
>>> Thanks,
>>> Tamir
>>>
>>
>>

Re: Uneven Reduce Issue

Posted by Tamir Kamara <ta...@gmail.com>.

Please disregard my comment about the UDF. Even without it - in the query
below: store b into ... the same behavior with the 11/12 reducers is seem.


On Sun, Jul 5, 2009 at 12:18 PM, Tamir Kamara <ta...@gmail.com> wrote:

> Hi,
>
> The config files are up to date and have the 12 reducers.
> I've been able to verify that this only happens when a UDF is used. I
> mostly use eval functions. An example script:
>
> a01 = load 'file' as (key: long, value: int);
> the same for a02-a31;
> b = cogroup a01 by key, a02 by key, ..., a31 by key;
> DEFINE MEDMAD14 pigUDF.MedMad('14');
> c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ..., a31));
>
> The MEDMAD function iterates over the input and produces a bag of a rolling
> median and mad according to the window value passed in the definition.
>
> If cogroup is followed by parallel 11 then all is fine (11 equal result
> parts) but if I use parallel 12 then I get only 3 large files and 9 zero
> bytes files.
>
> What do you think?
>
>
> Thanks,
> Tamir
>
>
>
> On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> For most operations Pig uses the default Hadoop partitioner.  We do set
>> our own partitioner for order by, but at the moment I believe that's it.
>>  How are you launching Pig?  Is it possible it's picking up an old
>> hadoop-site.xml file or something?
>>
>> Alan.
>>
>>
>> On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:
>>
>>  Hi,
>>>
>>> Recently my cluster configuration changed from 11 reducers to 12. Since
>>> then
>>> on every job using pig only 3 reducers do actual work and output results
>>> while the others quickly finish and output zero byte files. The result of
>>> the entire job is OK but getting there takes longer because of the uneven
>>> work distribution. Plain MR jobs are fine and do have even work
>>> distribution. Can this be something in pig ?
>>> (I'm using latest trunk)
>>>
>>> Thanks,
>>> Tamir
>>>
>>
>>
>

Re: Uneven Reduce Issue

Posted by Tamir Kamara <ta...@gmail.com>.

Hi,

The config files are up to date and have the 12 reducers.
I've been able to verify that this only happens when a UDF is used. I mostly
use eval functions. An example script:

a01 = load 'file' as (key: long, value: int);
the same for a02-a31;
b = cogroup a01 by key, a02 by key, ..., a31 by key;
DEFINE MEDMAD14 pigUDF.MedMad('14');
c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ..., a31));

The MEDMAD function iterates over the input and produces a bag of a rolling
median and mad according to the window value passed in the definition.

If cogroup is followed by parallel 11 then all is fine (11 equal result
parts) but if I use parallel 12 then I get only 3 large files and 9 zero
bytes files.

What do you think?

Thanks,
Tamir

On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> For most operations Pig uses the default Hadoop partitioner.  We do set our
> own partitioner for order by, but at the moment I believe that's it.  How
> are you launching Pig?  Is it possible it's picking up an old
> hadoop-site.xml file or something?
>
> Alan.
>
>
> On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:
>
>  Hi,
>>
>> Recently my cluster configuration changed from 11 reducers to 12. Since
>> then
>> on every job using pig only 3 reducers do actual work and output results
>> while the others quickly finish and output zero byte files. The result of
>> the entire job is OK but getting there takes longer because of the uneven
>> work distribution. Plain MR jobs are fine and do have even work
>> distribution. Can this be something in pig ?
>> (I'm using latest trunk)
>>
>> Thanks,
>> Tamir
>>
>
>

Re: Uneven Reduce Issue

Posted by Alan Gates <ga...@yahoo-inc.com>.

For most operations Pig uses the default Hadoop partitioner.  We do  
set our own partitioner for order by, but at the moment I believe  
that's it.  How are you launching Pig?  Is it possible it's picking up  
an old hadoop-site.xml file or something?

Alan.

On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

> Hi,
>
> Recently my cluster configuration changed from 11 reducers to 12.  
> Since then
> on every job using pig only 3 reducers do actual work and output  
> results
> while the others quickly finish and output zero byte files. The  
> result of
> the entire job is OK but getting there takes longer because of the  
> uneven
> work distribution. Plain MR jobs are fine and do have even work
> distribution. Can this be something in pig ?
> (I'm using latest trunk)
>
> Thanks,
> Tamir

Re: Uneven Reduce Issue

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.

Tamir, Can you provide example queries that result in this behavior, and
describe or provide the input data?

-D

On Thu, Jul 2, 2009 at 4:55 AM, Tamir Kamara <ta...@gmail.com> wrote:

> Hi,
>
> Recently my cluster configuration changed from 11 reducers to 12. Since
> then
> on every job using pig only 3 reducers do actual work and output results
> while the others quickly finish and output zero byte files. The result of
> the entire job is OK but getting there takes longer because of the uneven
> work distribution. Plain MR jobs are fine and do have even work
> distribution. Can this be something in pig ?
> (I'm using latest trunk)
>
> Thanks,
> Tamir
>