You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Ravi Kolluri <ra...@nuna.com> on 2015/10/12 23:31:00 UTC

crunch planner parameters

Hello Crunch users,

I have a question about what parameters go into the Crunch planner.

Lets say I have a crunch job with a set of input tables, and a fixed set of
calls to parallelDo and groupBy operations. Does the crunch execution plan
stay fixed independent of the size distribution of the inputs?

thanks,
Ravi

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: crunch planner parameters

Posted by Josh Wills <jo...@gmail.com>.
It is the latter approach, yes. The former would be better.

J

On Mon, Oct 12, 2015 at 3:56 PM, Everett Anderson <ev...@nuna.com> wrote:

> Hey Josh,
>
> Somewhat related question -- when computing the number of reducers, is the
> planner doing that at the start of each MR job, estimating the size of the
> map output and then calculating number of reducers based on the input data
> size going into the job?
>
> Or does it make the calculation at the very beginning of the pipeline
> after reading the sources?
>
> The former might be more accurate, with the latter suffering a compounding
> effect from poor estimation at any step.
>
>
>
> On Mon, Oct 12, 2015 at 3:46 PM, Josh Wills <jo...@gmail.com> wrote:
>
>> No, just the number of tasks involved in each job. The structure should
>> remain the same.
>>
>> J
>>
>> On Mon, Oct 12, 2015 at 3:44 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>>
>>>
>>> Thanks Josh!
>>>
>>> My question was more about how the planner organizes the map-reduce
>>> computation. Would the crunch job composition change based on input size?
>>>
>>> thanks,
>>> Ravi
>>>
>>>
>>> On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills <jo...@gmail.com>
>>> wrote:
>>>
>>>> Hey Ravi,
>>>>
>>>> The number of reducers used in the various stages of the MR job can
>>>> change if you don't hard-code them using groupByKey(int numReducers) or
>>>> groupByKey(GroupingOptions) (or the equivalent settings via the
>>>> JoinStrategy classes for joins). The planner will try to estimate the
>>>> number of bytes to be processed and aims to process 1GB of data per
>>>> reducer. If you do hard-code the number of reduce tasks, the planner will
>>>> respect your wishes no matter what the input size is.
>>>>
>>>> Josh
>>>>
>>>> On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>>>>
>>>>> Hello Crunch users,
>>>>>
>>>>> I have a question about what parameters go into the Crunch planner.
>>>>>
>>>>> Lets say I have a crunch job with a set of input tables, and a fixed
>>>>> set of calls to parallelDo and groupBy operations. Does the crunch
>>>>> execution plan stay fixed independent of the size distribution of the
>>>>> inputs?
>>>>>
>>>>> thanks,
>>>>> Ravi
>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>

Re: crunch planner parameters

Posted by Everett Anderson <ev...@nuna.com>.
Hey Josh,

Somewhat related question -- when computing the number of reducers, is the
planner doing that at the start of each MR job, estimating the size of the
map output and then calculating number of reducers based on the input data
size going into the job?

Or does it make the calculation at the very beginning of the pipeline after
reading the sources?

The former might be more accurate, with the latter suffering a compounding
effect from poor estimation at any step.



On Mon, Oct 12, 2015 at 3:46 PM, Josh Wills <jo...@gmail.com> wrote:

> No, just the number of tasks involved in each job. The structure should
> remain the same.
>
> J
>
> On Mon, Oct 12, 2015 at 3:44 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>
>>
>> Thanks Josh!
>>
>> My question was more about how the planner organizes the map-reduce
>> computation. Would the crunch job composition change based on input size?
>>
>> thanks,
>> Ravi
>>
>>
>> On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills <jo...@gmail.com> wrote:
>>
>>> Hey Ravi,
>>>
>>> The number of reducers used in the various stages of the MR job can
>>> change if you don't hard-code them using groupByKey(int numReducers) or
>>> groupByKey(GroupingOptions) (or the equivalent settings via the
>>> JoinStrategy classes for joins). The planner will try to estimate the
>>> number of bytes to be processed and aims to process 1GB of data per
>>> reducer. If you do hard-code the number of reduce tasks, the planner will
>>> respect your wishes no matter what the input size is.
>>>
>>> Josh
>>>
>>> On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>>>
>>>> Hello Crunch users,
>>>>
>>>> I have a question about what parameters go into the Crunch planner.
>>>>
>>>> Lets say I have a crunch job with a set of input tables, and a fixed
>>>> set of calls to parallelDo and groupBy operations. Does the crunch
>>>> execution plan stay fixed independent of the size distribution of the
>>>> inputs?
>>>>
>>>> thanks,
>>>> Ravi
>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: crunch planner parameters

Posted by Josh Wills <jo...@gmail.com>.
No, just the number of tasks involved in each job. The structure should
remain the same.

J

On Mon, Oct 12, 2015 at 3:44 PM, Ravi Kolluri <ra...@nuna.com> wrote:

>
> Thanks Josh!
>
> My question was more about how the planner organizes the map-reduce
> computation. Would the crunch job composition change based on input size?
>
> thanks,
> Ravi
>
>
> On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills <jo...@gmail.com> wrote:
>
>> Hey Ravi,
>>
>> The number of reducers used in the various stages of the MR job can
>> change if you don't hard-code them using groupByKey(int numReducers) or
>> groupByKey(GroupingOptions) (or the equivalent settings via the
>> JoinStrategy classes for joins). The planner will try to estimate the
>> number of bytes to be processed and aims to process 1GB of data per
>> reducer. If you do hard-code the number of reduce tasks, the planner will
>> respect your wishes no matter what the input size is.
>>
>> Josh
>>
>> On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>>
>>> Hello Crunch users,
>>>
>>> I have a question about what parameters go into the Crunch planner.
>>>
>>> Lets say I have a crunch job with a set of input tables, and a fixed set
>>> of calls to parallelDo and groupBy operations. Does the crunch execution
>>> plan stay fixed independent of the size distribution of the inputs?
>>>
>>> thanks,
>>> Ravi
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>

Re: crunch planner parameters

Posted by Ravi Kolluri <ra...@nuna.com>.
Thanks Josh!

My question was more about how the planner organizes the map-reduce
computation. Would the crunch job composition change based on input size?

thanks,
Ravi


On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills <jo...@gmail.com> wrote:

> Hey Ravi,
>
> The number of reducers used in the various stages of the MR job can change
> if you don't hard-code them using groupByKey(int numReducers) or
> groupByKey(GroupingOptions) (or the equivalent settings via the
> JoinStrategy classes for joins). The planner will try to estimate the
> number of bytes to be processed and aims to process 1GB of data per
> reducer. If you do hard-code the number of reduce tasks, the planner will
> respect your wishes no matter what the input size is.
>
> Josh
>
> On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ra...@nuna.com> wrote:
>
>> Hello Crunch users,
>>
>> I have a question about what parameters go into the Crunch planner.
>>
>> Lets say I have a crunch job with a set of input tables, and a fixed set
>> of calls to parallelDo and groupBy operations. Does the crunch execution
>> plan stay fixed independent of the size distribution of the inputs?
>>
>> thanks,
>> Ravi
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: crunch planner parameters

Posted by Josh Wills <jo...@gmail.com>.
Hey Ravi,

The number of reducers used in the various stages of the MR job can change
if you don't hard-code them using groupByKey(int numReducers) or
groupByKey(GroupingOptions) (or the equivalent settings via the
JoinStrategy classes for joins). The planner will try to estimate the
number of bytes to be processed and aims to process 1GB of data per
reducer. If you do hard-code the number of reduce tasks, the planner will
respect your wishes no matter what the input size is.

Josh

On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ra...@nuna.com> wrote:

> Hello Crunch users,
>
> I have a question about what parameters go into the Crunch planner.
>
> Lets say I have a crunch job with a set of input tables, and a fixed set
> of calls to parallelDo and groupBy operations. Does the crunch execution
> plan stay fixed independent of the size distribution of the inputs?
>
> thanks,
> Ravi
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.