You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Prashant Kommireddi <pr...@gmail.com> on 2012/12/04 07:32:47 UTC

Reducer estimation

I have been thinking about using Pig statistics for # reducers estimation.
Though the current heuristic approach
works fine, it requires an admin or the programmer to understand what the
best number would be for the job.
We are seeing a large number of jobs over-utilizing resources, and there is
obviously no default number that works well
for all kinds of pig scripts. A few non-technical users find it difficult
to estimate the best # for their jobs.
It would be great if we can use stats from previous runs of a job to set
the number
of reducers for future runs.

This would be a nice feature for jobs running in production, where the job
or the dataset size does not fluctuate
a huge deal.


   1. Set a config param in the script
      - set script.unique.id prashant.1111222111.demo_script
   2. If the above is not set, we fallback on the current implementation
   3. If the above is set
      - At the end of the job, persist PigStats (namely Reduce Shuffle
      Bytes) to FS (hdfs, local, s3....). This would be
"${script.unique.id}_YYYYMMDDHHmmss"
      - lets call this stats_on_hdfs
      - Read "stats_on_hdfs" for previous runs, and based on the number of
      such stats to read (based on
script.reducer.estimation.num.stats) calculate
      an average number of reducers for the current run.
      - If no stats_on_hdfs exists, we fallback on current implementation

It will be advised to not keep the retention of stats too long, and Pig can
make sure to clear up old files that are not required.

What do you guys think?

-Prashant

Re: Reducer estimation

Posted by Prashant Kommireddi <pr...@gmail.com>.
You could store stats for the last X runs and take an average for the next run.

Sent from my iPhone

On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> How would flat files work? The data needs to be updated by every pig run.
>
> On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <pr...@gmail.com> wrote:
>
>> Awesome! It would be good to have a flat-file based impl as there will
>> probably a lot of pig users not having an hbase instance setup for
>> stats persistence. Let me know if I can help in anyway.
>>
>> Is there a timeframe you are looking at for open-sourcing this?
>>
>>
>> On Dec 4, 2012, at 12:32 PM, Bill Graham <bi...@gmail.com> wrote:
>>
>>> We do basically what you're describing. Each of our scripts has a logical
>>> name which defines the workflow. For each job in the workflow we persist
>>> the job stats, counters and conf in HBase via an implementation of
>>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>>> workflow together based on the pig.script.start.time and pig.job.start time
>>> properties. We use the logical plan script signature to determine the
>>> script version has changed.
>>>
>>> During job execution we query the service in a impl of PigReducerEstimator
>>> for matching workflows.
>>>
>>> One simple estimation algo is to multiply Pig's default estimated reducers
>>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>>> over mapper input bytes of previous runs. The same could also be done with
>>> slot time, but we haven't tried that yet.
>>>
>>> We plan to open source parts of this at some point.
>>>
>>>
>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>>>
>>>> I have been thinking about using Pig statistics for # reducers estimation.
>>>> Though the current heuristic approach
>>>> works fine, it requires an admin or the programmer to understand what the
>>>> best number would be for the job.
>>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>>> obviously no default number that works well
>>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>>> to estimate the best # for their jobs.
>>>> It would be great if we can use stats from previous runs of a job to set
>>>> the number
>>>> of reducers for future runs.
>>>>
>>>> This would be a nice feature for jobs running in production, where the job
>>>> or the dataset size does not fluctuate
>>>> a huge deal.
>>>>
>>>>
>>>> 1. Set a config param in the script
>>>>    - set script.unique.id prashant.1111222111.demo_script
>>>> 2. If the above is not set, we fallback on the current implementation
>>>> 3. If the above is set
>>>>    - At the end of the job, persist PigStats (namely Reduce Shuffle
>>>>    Bytes) to FS (hdfs, local, s3....). This would be
>>>> "${script.unique.id}_YYYYMMDDHHmmss"
>>>>    - lets call this stats_on_hdfs
>>>>    - Read "stats_on_hdfs" for previous runs, and based on the number of
>>>>    such stats to read (based on
>>>> script.reducer.estimation.num.stats) calculate
>>>>    an average number of reducers for the current run.
>>>>    - If no stats_on_hdfs exists, we fallback on current implementation
>>>>
>>>> It will be advised to not keep the retention of stats too long, and Pig can
>>>> make sure to clear up old files that are not required.
>>>>
>>>> What do you guys think?
>>>>
>>>> -Prashant
>>>>
>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>>> billgraham@gmail.com going forward.*

Re: Reducer estimation

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
How would flat files work? The data needs to be updated by every pig run. 

On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <pr...@gmail.com> wrote:

> Awesome! It would be good to have a flat-file based impl as there will
> probably a lot of pig users not having an hbase instance setup for
> stats persistence. Let me know if I can help in anyway.
> 
> Is there a timeframe you are looking at for open-sourcing this?
> 
> 
> On Dec 4, 2012, at 12:32 PM, Bill Graham <bi...@gmail.com> wrote:
> 
>> We do basically what you're describing. Each of our scripts has a logical
>> name which defines the workflow. For each job in the workflow we persist
>> the job stats, counters and conf in HBase via an implementation of
>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>> workflow together based on the pig.script.start.time and pig.job.start time
>> properties. We use the logical plan script signature to determine the
>> script version has changed.
>> 
>> During job execution we query the service in a impl of PigReducerEstimator
>> for matching workflows.
>> 
>> One simple estimation algo is to multiply Pig's default estimated reducers
>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>> over mapper input bytes of previous runs. The same could also be done with
>> slot time, but we haven't tried that yet.
>> 
>> We plan to open source parts of this at some point.
>> 
>> 
>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>> 
>>> I have been thinking about using Pig statistics for # reducers estimation.
>>> Though the current heuristic approach
>>> works fine, it requires an admin or the programmer to understand what the
>>> best number would be for the job.
>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>> obviously no default number that works well
>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>> to estimate the best # for their jobs.
>>> It would be great if we can use stats from previous runs of a job to set
>>> the number
>>> of reducers for future runs.
>>> 
>>> This would be a nice feature for jobs running in production, where the job
>>> or the dataset size does not fluctuate
>>> a huge deal.
>>> 
>>> 
>>>  1. Set a config param in the script
>>>     - set script.unique.id prashant.1111222111.demo_script
>>>  2. If the above is not set, we fallback on the current implementation
>>>  3. If the above is set
>>>     - At the end of the job, persist PigStats (namely Reduce Shuffle
>>>     Bytes) to FS (hdfs, local, s3....). This would be
>>> "${script.unique.id}_YYYYMMDDHHmmss"
>>>     - lets call this stats_on_hdfs
>>>     - Read "stats_on_hdfs" for previous runs, and based on the number of
>>>     such stats to read (based on
>>> script.reducer.estimation.num.stats) calculate
>>>     an average number of reducers for the current run.
>>>     - If no stats_on_hdfs exists, we fallback on current implementation
>>> 
>>> It will be advised to not keep the retention of stats too long, and Pig can
>>> make sure to clear up old files that are not required.
>>> 
>>> What do you guys think?
>>> 
>>> -Prashant
>>> 
>> 
>> 
>> 
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*

Re: Reducer estimation

Posted by Prashant Kommireddi <pr...@gmail.com>.
Awesome! It would be good to have a flat-file based impl as there will
probably a lot of pig users not having an hbase instance setup for
stats persistence. Let me know if I can help in anyway.

Is there a timeframe you are looking at for open-sourcing this?


On Dec 4, 2012, at 12:32 PM, Bill Graham <bi...@gmail.com> wrote:

> We do basically what you're describing. Each of our scripts has a logical
> name which defines the workflow. For each job in the workflow we persist
> the job stats, counters and conf in HBase via an implementation of
> PigProgressNotificationListener. We can then correlate jobs in a run of the
> workflow together based on the pig.script.start.time and pig.job.start time
> properties. We use the logical plan script signature to determine the
> script version has changed.
>
> During job execution we query the service in a impl of PigReducerEstimator
> for matching workflows.
>
> One simple estimation algo is to multiply Pig's default estimated reducers
> (which are based on mapper input bytes) by the ratio of mapper output bytes
> over mapper input bytes of previous runs. The same could also be done with
> slot time, but we haven't tried that yet.
>
> We plan to open source parts of this at some point.
>
>
> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> I have been thinking about using Pig statistics for # reducers estimation.
>> Though the current heuristic approach
>> works fine, it requires an admin or the programmer to understand what the
>> best number would be for the job.
>> We are seeing a large number of jobs over-utilizing resources, and there is
>> obviously no default number that works well
>> for all kinds of pig scripts. A few non-technical users find it difficult
>> to estimate the best # for their jobs.
>> It would be great if we can use stats from previous runs of a job to set
>> the number
>> of reducers for future runs.
>>
>> This would be a nice feature for jobs running in production, where the job
>> or the dataset size does not fluctuate
>> a huge deal.
>>
>>
>>   1. Set a config param in the script
>>      - set script.unique.id prashant.1111222111.demo_script
>>   2. If the above is not set, we fallback on the current implementation
>>   3. If the above is set
>>      - At the end of the job, persist PigStats (namely Reduce Shuffle
>>      Bytes) to FS (hdfs, local, s3....). This would be
>> "${script.unique.id}_YYYYMMDDHHmmss"
>>      - lets call this stats_on_hdfs
>>      - Read "stats_on_hdfs" for previous runs, and based on the number of
>>      such stats to read (based on
>> script.reducer.estimation.num.stats) calculate
>>      an average number of reducers for the current run.
>>      - If no stats_on_hdfs exists, we fallback on current implementation
>>
>> It will be advised to not keep the retention of stats too long, and Pig can
>> make sure to clear up old files that are not required.
>>
>> What do you guys think?
>>
>> -Prashant
>>
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*

Re: Reducer estimation

Posted by Bill Graham <bi...@gmail.com>.
We do basically what you're describing. Each of our scripts has a logical
name which defines the workflow. For each job in the workflow we persist
the job stats, counters and conf in HBase via an implementation of
PigProgressNotificationListener. We can then correlate jobs in a run of the
workflow together based on the pig.script.start.time and pig.job.start time
properties. We use the logical plan script signature to determine the
script version has changed.

During job execution we query the service in a impl of PigReducerEstimator
for matching workflows.

One simple estimation algo is to multiply Pig's default estimated reducers
(which are based on mapper input bytes) by the ratio of mapper output bytes
over mapper input bytes of previous runs. The same could also be done with
slot time, but we haven't tried that yet.

We plan to open source parts of this at some point.


On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> I have been thinking about using Pig statistics for # reducers estimation.
> Though the current heuristic approach
> works fine, it requires an admin or the programmer to understand what the
> best number would be for the job.
> We are seeing a large number of jobs over-utilizing resources, and there is
> obviously no default number that works well
> for all kinds of pig scripts. A few non-technical users find it difficult
> to estimate the best # for their jobs.
> It would be great if we can use stats from previous runs of a job to set
> the number
> of reducers for future runs.
>
> This would be a nice feature for jobs running in production, where the job
> or the dataset size does not fluctuate
> a huge deal.
>
>
>    1. Set a config param in the script
>       - set script.unique.id prashant.1111222111.demo_script
>    2. If the above is not set, we fallback on the current implementation
>    3. If the above is set
>       - At the end of the job, persist PigStats (namely Reduce Shuffle
>       Bytes) to FS (hdfs, local, s3....). This would be
> "${script.unique.id}_YYYYMMDDHHmmss"
>       - lets call this stats_on_hdfs
>       - Read "stats_on_hdfs" for previous runs, and based on the number of
>       such stats to read (based on
> script.reducer.estimation.num.stats) calculate
>       an average number of reducers for the current run.
>       - If no stats_on_hdfs exists, we fallback on current implementation
>
> It will be advised to not keep the retention of stats too long, and Pig can
> make sure to clear up old files that are not required.
>
> What do you guys think?
>
> -Prashant
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*