You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Pankaj Gupta <pa...@brightroll.com> on 2012/06/01 20:45:53 UTC

Number of reduce tasks

Hi,

I just realized that one of my large scale pig jobs that has 100K map jobs actually only has one reduce task. Reading the documentation I see that the number of reduce tasks is defined by the PARALLEL clause whose default value is 1. I have a few questions around this:

# Why is the default value of reduce tasks 1?
# (Related to first question) Why aren't reduce tasks parallelized automatically in Pig?
# How do I choose a good value of reduce tasks for my pig jobs? 

Thanks in Advance,
Pankaj

Re: Number of reduce tasks

Posted by Pankaj Gupta <pa...@brightroll.com>.

Aniket: No, I am not using hcatalog. 

To follow up on this thread, I was indeed able to run multiple reduce tasks using the PARALLEL clause. Thanks everyone for helping out. Unfortunately, I ran into an out of memory error after that and I'm debugging that now (created a separate thread for advice).

Thanks,
Pankaj
On Jun 18, 2012, at 1:29 AM, Aniket Mokashi wrote:

> Pankaj, are you using hcatalog?
> 
> On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
> 
>> Right. And the documentation provides a list of operations that can be
>> parallelized.
>> 
>> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>> 
>>> That being said, some operators such as "group all" and limit, do
>> require using only 1 reducer, by nature. So it depends on what your script
>> is doing.
>>> 
>>> On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <pr...@gmail.com>
>> wrote:
>>> 
>>>> Automatic Heuristic works the same in 0.9.1
>>>> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
>>>> better off setting it manually looking at job tracker counters.
>>>> 
>>>> You should be fine with using PARALLEL for any of the operators
>> mentioned
>>>> on the doc.
>>>> 
>>>> -Prashant
>>>> 
>>>> 
>>>> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <pa...@brightroll.com>
>> wrote:
>>>> 
>>>>> Hi Prashant,
>>>>> 
>>>>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems
>> like a
>>>>> very useful upgrade. For the moment though it seems that I should be
>> able
>>>>> to use the 1GB per reducer heuristic and specify the number of
>> reducers in
>>>>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this
>> sound
>>>>> right?
>>>>> 
>>>>> Thanks,
>>>>> Pankaj
>>>>> 
>>>>> 
>>>>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>>>>> 
>>>>>> Also, please note default number of reducers are based on input
>> dataset.
>>>>> In
>>>>>> the basic case, Pig will "automatically" spawn a reducer for each GB
>> of
>>>>>> input, so if your input dataset size is 500 GB you should see 500
>>>>> reducers
>>>>>> being spawned (though this is excessive in a lot of cases).
>>>>>> 
>>>>>> This document talks about parallelism
>>>>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>>>>> 
>>>>>> Setting the right number of reducers (PARALLEL or set
>> default_parallel)
>>>>>> depends on what you are doing with it. If the reducer is CPU intensive
>>>>> (may
>>>>>> be a complex UDF running on reducer side), you would probably spawn
>> more
>>>>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB
>> per
>>>>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>>>>> 
>>>>>> 
>>>>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>>>> of reduce shuffle bytes and see if it performs well
>>>>>> 3. If not, adjust it according to your Reducer heap size. More the
>>>>> heap,
>>>>>> less is the data spilled to disk.
>>>>>> 
>>>>>> There are a few more properties on the Reduce side (buffer size etc)
>> but
>>>>>> that probably is not required to start with.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Prashant
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jcoveney@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Pankaj,
>>>>>>> 
>>>>>>> What version of pig are you using? In later versions of pig, it
>> should
>>>>> have
>>>>>>> some logic around automatically setting parallelisms (though
>> sometimes
>>>>>>> these heuristics will be wrong).
>>>>>>> 
>>>>>>> There are also some operations which will force you to use 1
>> reducer. It
>>>>>>> depends on what your script is doing.
>>>>>>> 
>>>>>>> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I just realized that one of my large scale pig jobs that has 100K
>> map
>>>>>>> jobs
>>>>>>>> actually only has one reduce task. Reading the documentation I see
>> that
>>>>>>> the
>>>>>>>> number of reduce tasks is defined by the PARALLEL clause whose
>> default
>>>>>>>> value is 1. I have a few questions around this:
>>>>>>>> 
>>>>>>>> # Why is the default value of reduce tasks 1?
>>>>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>>>>> automatically in Pig?
>>>>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>>>>> 
>>>>>>>> Thanks in Advance,
>>>>>>>> Pankaj
>>>>>>> 
>>>>> 
>>>>> 
>> 
> 
> 
> 
> -- 
> "...:::Aniket:::... Quetzalco@tl"

Re: Number of reduce tasks

Posted by Aniket Mokashi <an...@gmail.com>.

Pankaj, are you using hcatalog?

On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Right. And the documentation provides a list of operations that can be
> parallelized.
>
> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > That being said, some operators such as "group all" and limit, do
> require using only 1 reducer, by nature. So it depends on what your script
> is doing.
> >
> > On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <pr...@gmail.com>
> wrote:
> >
> >> Automatic Heuristic works the same in 0.9.1
> >> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> >> better off setting it manually looking at job tracker counters.
> >>
> >> You should be fine with using PARALLEL for any of the operators
> mentioned
> >> on the doc.
> >>
> >> -Prashant
> >>
> >>
> >> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> >>
> >>> Hi Prashant,
> >>>
> >>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems
> like a
> >>> very useful upgrade. For the moment though it seems that I should be
> able
> >>> to use the 1GB per reducer heuristic and specify the number of
> reducers in
> >>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this
> sound
> >>> right?
> >>>
> >>> Thanks,
> >>> Pankaj
> >>>
> >>>
> >>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
> >>>
> >>>> Also, please note default number of reducers are based on input
> dataset.
> >>> In
> >>>> the basic case, Pig will "automatically" spawn a reducer for each GB
> of
> >>>> input, so if your input dataset size is 500 GB you should see 500
> >>> reducers
> >>>> being spawned (though this is excessive in a lot of cases).
> >>>>
> >>>> This document talks about parallelism
> >>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> >>>>
> >>>> Setting the right number of reducers (PARALLEL or set
> default_parallel)
> >>>> depends on what you are doing with it. If the reducer is CPU intensive
> >>> (may
> >>>> be a complex UDF running on reducer side), you would probably spawn
> more
> >>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB
> per
> >>>> reducer) holds good for regular aggregations (SUM, COUNT..).
> >>>>
> >>>>
> >>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
> >>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
> >>>> of reduce shuffle bytes and see if it performs well
> >>>> 3. If not, adjust it according to your Reducer heap size. More the
> >>> heap,
> >>>> less is the data spilled to disk.
> >>>>
> >>>> There are a few more properties on the Reduce side (buffer size etc)
> but
> >>>> that probably is not required to start with.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Prashant
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jcoveney@gmail.com
> >>>> wrote:
> >>>>
> >>>>> Pankaj,
> >>>>>
> >>>>> What version of pig are you using? In later versions of pig, it
> should
> >>> have
> >>>>> some logic around automatically setting parallelisms (though
> sometimes
> >>>>> these heuristics will be wrong).
> >>>>>
> >>>>> There are also some operations which will force you to use 1
> reducer. It
> >>>>> depends on what your script is doing.
> >>>>>
> >>>>> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I just realized that one of my large scale pig jobs that has 100K
> map
> >>>>> jobs
> >>>>>> actually only has one reduce task. Reading the documentation I see
> that
> >>>>> the
> >>>>>> number of reduce tasks is defined by the PARALLEL clause whose
> default
> >>>>>> value is 1. I have a few questions around this:
> >>>>>>
> >>>>>> # Why is the default value of reduce tasks 1?
> >>>>>> # (Related to first question) Why aren't reduce tasks parallelized
> >>>>>> automatically in Pig?
> >>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
> >>>>>>
> >>>>>> Thanks in Advance,
> >>>>>> Pankaj
> >>>>>
> >>>
> >>>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Number of reduce tasks

Posted by Prashant Kommireddi <pr...@gmail.com>.

Right. And the documentation provides a list of operations that can be
parallelized.

On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> That being said, some operators such as "group all" and limit, do require using only 1 reducer, by nature. So it depends on what your script is doing.
>
> On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <pr...@gmail.com> wrote:
>
>> Automatic Heuristic works the same in 0.9.1
>> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
>> better off setting it manually looking at job tracker counters.
>>
>> You should be fine with using PARALLEL for any of the operators mentioned
>> on the doc.
>>
>> -Prashant
>>
>>
>> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <pa...@brightroll.com> wrote:
>>
>>> Hi Prashant,
>>>
>>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
>>> very useful upgrade. For the moment though it seems that I should be able
>>> to use the 1GB per reducer heuristic and specify the number of reducers in
>>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
>>> right?
>>>
>>> Thanks,
>>> Pankaj
>>>
>>>
>>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>>>
>>>> Also, please note default number of reducers are based on input dataset.
>>> In
>>>> the basic case, Pig will "automatically" spawn a reducer for each GB of
>>>> input, so if your input dataset size is 500 GB you should see 500
>>> reducers
>>>> being spawned (though this is excessive in a lot of cases).
>>>>
>>>> This document talks about parallelism
>>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>>>
>>>> Setting the right number of reducers (PARALLEL or set default_parallel)
>>>> depends on what you are doing with it. If the reducer is CPU intensive
>>> (may
>>>> be a complex UDF running on reducer side), you would probably spawn more
>>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
>>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>>>
>>>>
>>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>> of reduce shuffle bytes and see if it performs well
>>>> 3. If not, adjust it according to your Reducer heap size. More the
>>> heap,
>>>> less is the data spilled to disk.
>>>>
>>>> There are a few more properties on the Reduce side (buffer size etc) but
>>>> that probably is not required to start with.
>>>>
>>>> Thanks,
>>>>
>>>> Prashant
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jcoveney@gmail.com
>>>> wrote:
>>>>
>>>>> Pankaj,
>>>>>
>>>>> What version of pig are you using? In later versions of pig, it should
>>> have
>>>>> some logic around automatically setting parallelisms (though sometimes
>>>>> these heuristics will be wrong).
>>>>>
>>>>> There are also some operations which will force you to use 1 reducer. It
>>>>> depends on what your script is doing.
>>>>>
>>>>> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I just realized that one of my large scale pig jobs that has 100K map
>>>>> jobs
>>>>>> actually only has one reduce task. Reading the documentation I see that
>>>>> the
>>>>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>>>>> value is 1. I have a few questions around this:
>>>>>>
>>>>>> # Why is the default value of reduce tasks 1?
>>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>>> automatically in Pig?
>>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>>>
>>>>>> Thanks in Advance,
>>>>>> Pankaj
>>>>>
>>>
>>>

Re: Number of reduce tasks

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

That being said, some operators such as "group all" and limit, do require using only 1 reducer, by nature. So it depends on what your script is doing. 

On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <pr...@gmail.com> wrote:

> Automatic Heuristic works the same in 0.9.1
> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> better off setting it manually looking at job tracker counters.
> 
> You should be fine with using PARALLEL for any of the operators mentioned
> on the doc.
> 
> -Prashant
> 
> 
> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <pa...@brightroll.com> wrote:
> 
>> Hi Prashant,
>> 
>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
>> very useful upgrade. For the moment though it seems that I should be able
>> to use the 1GB per reducer heuristic and specify the number of reducers in
>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
>> right?
>> 
>> Thanks,
>> Pankaj
>> 
>> 
>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>> 
>>> Also, please note default number of reducers are based on input dataset.
>> In
>>> the basic case, Pig will "automatically" spawn a reducer for each GB of
>>> input, so if your input dataset size is 500 GB you should see 500
>> reducers
>>> being spawned (though this is excessive in a lot of cases).
>>> 
>>> This document talks about parallelism
>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>> 
>>> Setting the right number of reducers (PARALLEL or set default_parallel)
>>> depends on what you are doing with it. If the reducer is CPU intensive
>> (may
>>> be a complex UDF running on reducer side), you would probably spawn more
>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>> 
>>> 
>>>  1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>  2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>  of reduce shuffle bytes and see if it performs well
>>>  3. If not, adjust it according to your Reducer heap size. More the
>> heap,
>>>  less is the data spilled to disk.
>>> 
>>> There are a few more properties on the Reduce side (buffer size etc) but
>>> that probably is not required to start with.
>>> 
>>> Thanks,
>>> 
>>> Prashant
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jcoveney@gmail.com
>>> wrote:
>>> 
>>>> Pankaj,
>>>> 
>>>> What version of pig are you using? In later versions of pig, it should
>> have
>>>> some logic around automatically setting parallelisms (though sometimes
>>>> these heuristics will be wrong).
>>>> 
>>>> There are also some operations which will force you to use 1 reducer. It
>>>> depends on what your script is doing.
>>>> 
>>>> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I just realized that one of my large scale pig jobs that has 100K map
>>>> jobs
>>>>> actually only has one reduce task. Reading the documentation I see that
>>>> the
>>>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>>>> value is 1. I have a few questions around this:
>>>>> 
>>>>> # Why is the default value of reduce tasks 1?
>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>> automatically in Pig?
>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>> 
>>>>> Thanks in Advance,
>>>>> Pankaj
>>>> 
>> 
>>

Re: Number of reduce tasks

Posted by Prashant Kommireddi <pr...@gmail.com>.

Automatic Heuristic works the same in 0.9.1
http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
better off setting it manually looking at job tracker counters.

You should be fine with using PARALLEL for any of the operators mentioned
on the doc.

-Prashant


On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <pa...@brightroll.com> wrote:

> Hi Prashant,
>
> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a
> very useful upgrade. For the moment though it seems that I should be able
> to use the 1GB per reducer heuristic and specify the number of reducers in
> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound
> right?
>
> Thanks,
> Pankaj
>
>
> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>
> > Also, please note default number of reducers are based on input dataset.
> In
> > the basic case, Pig will "automatically" spawn a reducer for each GB of
> > input, so if your input dataset size is 500 GB you should see 500
> reducers
> > being spawned (though this is excessive in a lot of cases).
> >
> > This document talks about parallelism
> > http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> >
> > Setting the right number of reducers (PARALLEL or set default_parallel)
> > depends on what you are doing with it. If the reducer is CPU intensive
> (may
> > be a complex UDF running on reducer side), you would probably spawn more
> > reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
> > reducer) holds good for regular aggregations (SUM, COUNT..).
> >
> >
> >   1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
> >   2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
> >   of reduce shuffle bytes and see if it performs well
> >   3. If not, adjust it according to your Reducer heap size. More the
> heap,
> >   less is the data spilled to disk.
> >
> > There are a few more properties on the Reduce side (buffer size etc) but
> > that probably is not required to start with.
> >
> > Thanks,
> >
> > Prashant
> >
> >
> >
> >
> > On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
> >
> >> Pankaj,
> >>
> >> What version of pig are you using? In later versions of pig, it should
> have
> >> some logic around automatically setting parallelisms (though sometimes
> >> these heuristics will be wrong).
> >>
> >> There are also some operations which will force you to use 1 reducer. It
> >> depends on what your script is doing.
> >>
> >> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
> >>
> >>> Hi,
> >>>
> >>> I just realized that one of my large scale pig jobs that has 100K map
> >> jobs
> >>> actually only has one reduce task. Reading the documentation I see that
> >> the
> >>> number of reduce tasks is defined by the PARALLEL clause whose default
> >>> value is 1. I have a few questions around this:
> >>>
> >>> # Why is the default value of reduce tasks 1?
> >>> # (Related to first question) Why aren't reduce tasks parallelized
> >>> automatically in Pig?
> >>> # How do I choose a good value of reduce tasks for my pig jobs?
> >>>
> >>> Thanks in Advance,
> >>> Pankaj
> >>
>
>

Re: Number of reduce tasks

Posted by Pankaj Gupta <pa...@brightroll.com>.

Hi Prashant,

Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a very useful upgrade. For the moment though it seems that I should be able to use the 1GB per reducer heuristic and specify the number of reducers in Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound right?

Thanks,
Pankaj


On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:

> Also, please note default number of reducers are based on input dataset. In
> the basic case, Pig will "automatically" spawn a reducer for each GB of
> input, so if your input dataset size is 500 GB you should see 500 reducers
> being spawned (though this is excessive in a lot of cases).
> 
> This document talks about parallelism
> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> 
> Setting the right number of reducers (PARALLEL or set default_parallel)
> depends on what you are doing with it. If the reducer is CPU intensive (may
> be a complex UDF running on reducer side), you would probably spawn more
> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
> reducer) holds good for regular aggregations (SUM, COUNT..).
> 
> 
>   1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>   2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>   of reduce shuffle bytes and see if it performs well
>   3. If not, adjust it according to your Reducer heap size. More the heap,
>   less is the data spilled to disk.
> 
> There are a few more properties on the Reduce side (buffer size etc) but
> that probably is not required to start with.
> 
> Thanks,
> 
> Prashant
> 
> 
> 
> 
> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jc...@gmail.com>wrote:
> 
>> Pankaj,
>> 
>> What version of pig are you using? In later versions of pig, it should have
>> some logic around automatically setting parallelisms (though sometimes
>> these heuristics will be wrong).
>> 
>> There are also some operations which will force you to use 1 reducer. It
>> depends on what your script is doing.
>> 
>> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
>> 
>>> Hi,
>>> 
>>> I just realized that one of my large scale pig jobs that has 100K map
>> jobs
>>> actually only has one reduce task. Reading the documentation I see that
>> the
>>> number of reduce tasks is defined by the PARALLEL clause whose default
>>> value is 1. I have a few questions around this:
>>> 
>>> # Why is the default value of reduce tasks 1?
>>> # (Related to first question) Why aren't reduce tasks parallelized
>>> automatically in Pig?
>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>> 
>>> Thanks in Advance,
>>> Pankaj
>>

Re: Number of reduce tasks

Posted by Prashant Kommireddi <pr...@gmail.com>.

Also, please note default number of reducers are based on input dataset. In
the basic case, Pig will "automatically" spawn a reducer for each GB of
input, so if your input dataset size is 500 GB you should see 500 reducers
being spawned (though this is excessive in a lot of cases).

This document talks about parallelism
http://pig.apache.org/docs/r0.10.0/perf.html#parallel

Setting the right number of reducers (PARALLEL or set default_parallel)
depends on what you are doing with it. If the reducer is CPU intensive (may
be a complex UDF running on reducer side), you would probably spawn more
reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per
reducer) holds good for regular aggregations (SUM, COUNT..).

   1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
   2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
   of reduce shuffle bytes and see if it performs well
   3. If not, adjust it according to your Reducer heap size. More the heap,
   less is the data spilled to disk.

There are a few more properties on the Reduce side (buffer size etc) but
that probably is not required to start with.

Thanks,

Prashant

On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> Pankaj,
>
> What version of pig are you using? In later versions of pig, it should have
> some logic around automatically setting parallelisms (though sometimes
> these heuristics will be wrong).
>
> There are also some operations which will force you to use 1 reducer. It
> depends on what your script is doing.
>
> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
>
> > Hi,
> >
> > I just realized that one of my large scale pig jobs that has 100K map
> jobs
> > actually only has one reduce task. Reading the documentation I see that
> the
> > number of reduce tasks is defined by the PARALLEL clause whose default
> > value is 1. I have a few questions around this:
> >
> > # Why is the default value of reduce tasks 1?
> > # (Related to first question) Why aren't reduce tasks parallelized
> > automatically in Pig?
> > # How do I choose a good value of reduce tasks for my pig jobs?
> >
> > Thanks in Advance,
> > Pankaj
>

Re: Number of reduce tasks

Posted by Pankaj Gupta <pa...@brightroll.com>.

I am using Pig version 0.9.1.

On Jun 1, 2012, at 11:49 AM, Jonathan Coveney wrote:

> Pankaj,
> 
> What version of pig are you using? In later versions of pig, it should have
> some logic around automatically setting parallelisms (though sometimes
> these heuristics will be wrong).
> 
> There are also some operations which will force you to use 1 reducer. It
> depends on what your script is doing.
> 
> 2012/6/1 Pankaj Gupta <pa...@brightroll.com>
> 
>> Hi,
>> 
>> I just realized that one of my large scale pig jobs that has 100K map jobs
>> actually only has one reduce task. Reading the documentation I see that the
>> number of reduce tasks is defined by the PARALLEL clause whose default
>> value is 1. I have a few questions around this:
>> 
>> # Why is the default value of reduce tasks 1?
>> # (Related to first question) Why aren't reduce tasks parallelized
>> automatically in Pig?
>> # How do I choose a good value of reduce tasks for my pig jobs?
>> 
>> Thanks in Advance,
>> Pankaj

Re: Number of reduce tasks

Posted by Jonathan Coveney <jc...@gmail.com>.

Pankaj,

What version of pig are you using? In later versions of pig, it should have
some logic around automatically setting parallelisms (though sometimes
these heuristics will be wrong).

There are also some operations which will force you to use 1 reducer. It
depends on what your script is doing.

2012/6/1 Pankaj Gupta <pa...@brightroll.com>

> Hi,
>
> I just realized that one of my large scale pig jobs that has 100K map jobs
> actually only has one reduce task. Reading the documentation I see that the
> number of reduce tasks is defined by the PARALLEL clause whose default
> value is 1. I have a few questions around this:
>
> # Why is the default value of reduce tasks 1?
> # (Related to first question) Why aren't reduce tasks parallelized
> automatically in Pig?
> # How do I choose a good value of reduce tasks for my pig jobs?
>
> Thanks in Advance,
> Pankaj