You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by abhishek dodda <ab...@gmail.com> on 2012/10/16 05:47:41 UTC

Pig optimization rules

hi all,

I am trying to learn and implement pig optimization rules, Can any one help
me understanding below properities.

The amount of memory allocated to bags is determined by
*pig.cachedbag.memusage;
the default is set to 20% (0.2) of available memory.* Note that this memory
is shared across all large bags used by the application.

*Which memory is this ?? 20% which memory is allocated.*

Which factor to be considered to set number of number of reducers to a
outer join query with replicated.
Will increasing number of reducers in a outer join query improve the
performance ??

Regards
Abhi

Re: Pig optimization rules

Posted by abhishek dodda <ab...@gmail.com>.

Hi Thejas,

Thanks for reply.
Which factor to be considered to set number of number of reducers to
a outer join query with replicated,Will increasing number of reducers in a
outer join query improve the performance ??

Yes, it should increase performance. One thing to watch is for skew among
the reduce runtime. If the reduce runtimes are very skewed, you might want
to consider skew join.

I have some questions

1) I got your answer.But question here is* REDUCER NUMBER * what should i
consider to fire number of reducers.

example :

 A = JOIN B BY ID LEFT OUTER,C BY ID using 'replicated' parallel ??,  *<
------------ How do i select this number.*
*
*
2) What is the importance of this property
*mapred.job.reduce.markreset.buffer.percent
, *How does it effects the performance and what is the optimal value for
this parameter.

3) I have read that *Bloom Filter *in pig 0.10 effects the join
performance, How efficient is Bloom filter compared to Replicated join.Can
Bloom filter be applied for Outer join.

Regards
Abhishek

On Tue, Oct 16, 2012 at 10:04 PM, Thejas Nair <th...@hortonworks.com>wrote:

> On 10/15/12 8:47 PM, abhishek dodda wrote:
>
>> hi all,
>>
>> I am trying to learn and implement pig optimization rules, Can any one
>> help
>> me understanding below properities.
>>
>> The amount of memory allocated to bags is determined by
>> *pig.cachedbag.memusage;
>> the default is set to 20% (0.2) of available memory.* Note that this
>> memory
>>
>> is shared across all large bags used by the application.
>>
>> *Which memory is this ?? 20% which memory is allocated.*
>>
>
> This is 20% of the map/reduce task available memory, ie the jvm maximum
> memory limit.
>
>
>  Which factor to be considered to set number of number of reducers to a
>> outer join query with replicated.
>> Will increasing number of reducers in a outer join query improve the
>> performance ??
>>
>>
> Yes, it should increase performance. One thing to watch is for skew among
> the reduce runtime. If the reduce runtimes are very skewed, you might want
> to consider skew join.
>
> -Thejas
>
>

Re: Pig optimization rules

Posted by Thejas Nair <th...@hortonworks.com>.

On 10/15/12 8:47 PM, abhishek dodda wrote:
> hi all,
>
> I am trying to learn and implement pig optimization rules, Can any one help
> me understanding below properities.
>
> The amount of memory allocated to bags is determined by
> *pig.cachedbag.memusage;
> the default is set to 20% (0.2) of available memory.* Note that this memory
> is shared across all large bags used by the application.
>
> *Which memory is this ?? 20% which memory is allocated.*

This is 20% of the map/reduce task available memory, ie the jvm maximum 
memory limit.

> Which factor to be considered to set number of number of reducers to a
> outer join query with replicated.
> Will increasing number of reducers in a outer join query improve the
> performance ??
>

Yes, it should increase performance. One thing to watch is for skew 
among the reduce runtime. If the reduce runtimes are very skewed, you 
might want to consider skew join.

-Thejas