You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jie Li <ji...@cs.duke.edu> on 2012/06/21 20:14:09 UTC

Some proposals for Pig performance optimization

Hello everyone,

I compiled a list of possible optimizaiton for Pig's performance.

https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization

As I haven't been very familiar with the codebase, I'm likely to
underestimate the complexity involved, so any input will be
appreciated.

Thanks,
Jie

Re: Some proposals for Pig performance optimization

Posted by Jie Li <ji...@cs.duke.edu>.

Thanks Thejas for the comments! See my answers inline.

> 1. Order-by
> The comparison against hive order-by is misleading. Hive does not do total
> ordering, unless you use a single reducer.
> But yes, in case of pig, the sampling phase is unnecessary, if you use a
> single reducer. A single reducer can make sense if the data you are sorting
> is small. I agree that it makes sense to remove the sampling phase in pig in
> such cases.

Yes the environment set up uses only 1GB data, so there is only 1
reducer for the order-by.  I've also updated the doc that Hive always
uses 1 reducer for the order-by.

I'll also make sure Pig/Hive use same number of maps/reduces if
possible and update the doc.

> 2. Lazy type conversion
> Can you add a note about how many records are there in input vs output ?
> In this example, we can improve by using the logical optimizer, so only
> necessary parts are typecast before the filter.
>

I've purposely filtered out all the input records. From the logical
plan, the filter is not pushed above the foreach, which can be a
separate issue that need investigating. Therefore, each record is
fully deserialized and then thrown away.

> One problem in pig is that it uses java objects like Integer, String etc
> which are final types. Which means that we can't create a subclass by that
> delays the conversion until it actually gets used.  The types are part of
> the udf interface. We should consider if we want to do something like this,
> when we add new udf interfaces.
>
> Some thoughts on serialization/deserialization improvements that i had
> written earlier - http://wiki.apache.org/pig/AvoidingSedes
>

Thanks for sharing these thoughts! I'll incorporate it into the doc
and discuss more details later.

Jie

> Thanks,
> Thejas
>
>
>
>
>
>
>
> On 6/21/12 11:14 AM, Jie Li wrote:
>>
>> Hello everyone,
>>
>> I compiled a list of possible optimizaiton for Pig's performance.
>>
>>
>> https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization
>>
>> As I haven't been very familiar with the codebase, I'm likely to
>> underestimate the complexity involved, so any input will be
>> appreciated.
>>
>> Thanks,
>> Jie
>
>

Re: Some proposals for Pig performance optimization

Posted by Thejas Nair <th...@hortonworks.com>.

bcc'ing the user list.

1. Order-by
The comparison against hive order-by is misleading. Hive does not do 
total ordering, unless you use a single reducer.
But yes, in case of pig, the sampling phase is unnecessary, if you use a 
single reducer. A single reducer can make sense if the data you are 
sorting is small. I agree that it makes sense to remove the sampling 
phase in pig in such cases.

2. Lazy type conversion
Can you add a note about how many records are there in input vs output ?
In this example, we can improve by using the logical optimizer, so only 
necessary parts are typecast before the filter.

One problem in pig is that it uses java objects like Integer, String etc 
which are final types. Which means that we can't create a subclass by 
that delays the conversion until it actually gets used.  The types are 
part of the udf interface. We should consider if we want to do something 
like this, when we add new udf interfaces.

Some thoughts on serialization/deserialization improvements that i had 
written earlier - http://wiki.apache.org/pig/AvoidingSedes

Thanks,
Thejas

On 6/21/12 11:14 AM, Jie Li wrote:
> Hello everyone,
>
> I compiled a list of possible optimizaiton for Pig's performance.
>
> https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization
>
> As I haven't been very familiar with the codebase, I'm likely to
> underestimate the complexity involved, so any input will be
> appreciated.
>
> Thanks,
> Jie

Re: Some proposals for Pig performance optimization

Posted by Thejas Nair <th...@hortonworks.com>.

bcc'ing the user list.

1. Order-by
The comparison against hive order-by is misleading. Hive does not do 
total ordering, unless you use a single reducer.
But yes, in case of pig, the sampling phase is unnecessary, if you use a 
single reducer. A single reducer can make sense if the data you are 
sorting is small. I agree that it makes sense to remove the sampling 
phase in pig in such cases.

2. Lazy type conversion
Can you add a note about how many records are there in input vs output ?
In this example, we can improve by using the logical optimizer, so only 
necessary parts are typecast before the filter.

One problem in pig is that it uses java objects like Integer, String etc 
which are final types. Which means that we can't create a subclass by 
that delays the conversion until it actually gets used.  The types are 
part of the udf interface. We should consider if we want to do something 
like this, when we add new udf interfaces.

Some thoughts on serialization/deserialization improvements that i had 
written earlier - http://wiki.apache.org/pig/AvoidingSedes

Thanks,
Thejas

On 6/21/12 11:14 AM, Jie Li wrote:
> Hello everyone,
>
> I compiled a list of possible optimizaiton for Pig's performance.
>
> https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization
>
> As I haven't been very familiar with the codebase, I'm likely to
> underestimate the complexity involved, so any input will be
> appreciated.
>
> Thanks,
> Jie