You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/01/18 11:35:46 UTC

skewed optimizations

Hi again,

I am wanting to know about the using "skewed" optimization is only
applicable for JOIN's ?

Is it (or will it be) available for GROUP BY's and ORDER BY's ? Or is it not
logically possible?

thanks,

Rob Stewart

Re: skewed optimizations

Posted by Alan Gates <ga...@yahoo-inc.com>.

Sorry, I should have taken the time to figure out that Rekha is a  
feminine name.

Alan.

On Jan 19, 2010, at 8:04 PM, Rekha Joshi wrote:

> Thx Alan...and on a lighter note, Rekha is a she, not a he...though  
> she would have liked to be a he, could play football matches, drink  
> beers etc..:-)
>
> Cheers,
> /R
>
> On 1/20/10 1:41 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>
> Let me elaborate on what Rekha said.  He's correct that Pig does it
> automatically for order by.  It has to sample the input to the order
> by to decide how to distribute the keys.  As part of this is notices
> any skew and spreads skewed keys across multiple reducers.
>
> Group by is harder to handle the skew in then join and order because
> by definition grouping requires that all equal keys be collected
> together.  So Pig cannot "cheat" and split up keys on the reducers
> like it does in join and order.  Again, Rekha is correct that the
> combiner helps here.  If the operation is algebraic then using the
> combiner guarantees that the number of occurrences of a given key in a
> given reducer is bounded by the number of maps and the fan in factor
> of the reduce side merge.  This will not eliminate skew, but it will
> bound it.  In practice we see that this greatly increases
> performance.  So whenever possible aggregate functions should
> implement the Algebraic interface.
>
> For operations that are not algebraic there is not a lot that can be
> done.  The accumulator interface helps Pig survive these situations
> without running out of memory.  But it does not remove or even bound
> the skew.  This means that if there is skew in the data, group will
> see the lagging reducers issue.
>
> Alan.
>
> On Jan 18, 2010, at 3:09 AM, Rekha Joshi wrote:
>
>> Hi,
>>
>> AFAIK, the skewed is only for join, basically handling skews in
>> input files by splitting one of the input on the join predicate and
>> streaming the other input.
>>
>> So skewed works when input datasets are more than 1,as in join and
>> the processing differs than that required for order/group by
>>
>> For operations like order by pig will distribute skewed keys across
>> multiple reducers and groupby might be using combiner/
>> accumulator..Seems to be no need for skewed separately.
>>
>> Cheers,
>> /R
>>
>> On 1/18/10 4:05 PM, "Rob Stewart" <ro...@googlemail.com>  
>> wrote:
>>
>> Hi again,
>>
>> I am wanting to know about the using "skewed" optimization is only
>> applicable for JOIN's ?
>>
>> Is it (or will it be) available for GROUP BY's and ORDER BY's ? Or
>> is it not
>> logically possible?
>>
>> thanks,
>>
>> Rob Stewart
>>
>
>

Re: skewed optimizations

Posted by Rekha Joshi <re...@yahoo-inc.com>.

Thx Alan...and on a lighter note, Rekha is a she, not a he...though she would have liked to be a he, could play football matches, drink beers etc..:-)

Cheers,
/R

On 1/20/10 1:41 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

Let me elaborate on what Rekha said.  He's correct that Pig does it
automatically for order by.  It has to sample the input to the order
by to decide how to distribute the keys.  As part of this is notices
any skew and spreads skewed keys across multiple reducers.

Group by is harder to handle the skew in then join and order because
by definition grouping requires that all equal keys be collected
together.  So Pig cannot "cheat" and split up keys on the reducers
like it does in join and order.  Again, Rekha is correct that the
combiner helps here.  If the operation is algebraic then using the
combiner guarantees that the number of occurrences of a given key in a
given reducer is bounded by the number of maps and the fan in factor
of the reduce side merge.  This will not eliminate skew, but it will
bound it.  In practice we see that this greatly increases
performance.  So whenever possible aggregate functions should
implement the Algebraic interface.

For operations that are not algebraic there is not a lot that can be
done.  The accumulator interface helps Pig survive these situations
without running out of memory.  But it does not remove or even bound
the skew.  This means that if there is skew in the data, group will
see the lagging reducers issue.

Alan.

On Jan 18, 2010, at 3:09 AM, Rekha Joshi wrote:

> Hi,
>
> AFAIK, the skewed is only for join, basically handling skews in
> input files by splitting one of the input on the join predicate and
> streaming the other input.
>
> So skewed works when input datasets are more than 1,as in join and
> the processing differs than that required for order/group by
>
> For operations like order by pig will distribute skewed keys across
> multiple reducers and groupby might be using combiner/
> accumulator..Seems to be no need for skewed separately.
>
> Cheers,
> /R
>
> On 1/18/10 4:05 PM, "Rob Stewart" <ro...@googlemail.com> wrote:
>
> Hi again,
>
> I am wanting to know about the using "skewed" optimization is only
> applicable for JOIN's ?
>
> Is it (or will it be) available for GROUP BY's and ORDER BY's ? Or
> is it not
> logically possible?
>
> thanks,
>
> Rob Stewart
>

Re: skewed optimizations

Posted by Alan Gates <ga...@yahoo-inc.com>.

Let me elaborate on what Rekha said.  He's correct that Pig does it  
automatically for order by.  It has to sample the input to the order  
by to decide how to distribute the keys.  As part of this is notices  
any skew and spreads skewed keys across multiple reducers.

Group by is harder to handle the skew in then join and order because  
by definition grouping requires that all equal keys be collected  
together.  So Pig cannot "cheat" and split up keys on the reducers  
like it does in join and order.  Again, Rekha is correct that the  
combiner helps here.  If the operation is algebraic then using the  
combiner guarantees that the number of occurrences of a given key in a  
given reducer is bounded by the number of maps and the fan in factor  
of the reduce side merge.  This will not eliminate skew, but it will  
bound it.  In practice we see that this greatly increases  
performance.  So whenever possible aggregate functions should  
implement the Algebraic interface.

For operations that are not algebraic there is not a lot that can be  
done.  The accumulator interface helps Pig survive these situations  
without running out of memory.  But it does not remove or even bound  
the skew.  This means that if there is skew in the data, group will  
see the lagging reducers issue.

Alan.

On Jan 18, 2010, at 3:09 AM, Rekha Joshi wrote:

> Hi,
>
> AFAIK, the skewed is only for join, basically handling skews in  
> input files by splitting one of the input on the join predicate and  
> streaming the other input.
>
> So skewed works when input datasets are more than 1,as in join and  
> the processing differs than that required for order/group by
>
> For operations like order by pig will distribute skewed keys across  
> multiple reducers and groupby might be using combiner/ 
> accumulator..Seems to be no need for skewed separately.
>
> Cheers,
> /R
>
> On 1/18/10 4:05 PM, "Rob Stewart" <ro...@googlemail.com> wrote:
>
> Hi again,
>
> I am wanting to know about the using "skewed" optimization is only
> applicable for JOIN's ?
>
> Is it (or will it be) available for GROUP BY's and ORDER BY's ? Or  
> is it not
> logically possible?
>
> thanks,
>
> Rob Stewart
>

Re: skewed optimizations

Posted by Rekha Joshi <re...@yahoo-inc.com>.

Hi,

AFAIK, the skewed is only for join, basically handling skews in input files by splitting one of the input on the join predicate and streaming the other input.

So skewed works when input datasets are more than 1,as in join and the processing differs than that required for order/group by

For operations like order by pig will distribute skewed keys across multiple reducers and groupby might be using combiner/accumulator..Seems to be no need for skewed separately.

Cheers,
/R

On 1/18/10 4:05 PM, "Rob Stewart" <ro...@googlemail.com> wrote:

Hi again,

I am wanting to know about the using "skewed" optimization is only
applicable for JOIN's ?

Is it (or will it be) available for GROUP BY's and ORDER BY's ? Or is it not
logically possible?

thanks,

Rob Stewart