You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Laukik Chitnis (JIRA)" <ji...@apache.org> on 2011/03/04 18:23:37 UTC

[jira] Created: (PIG-1883) Pig's progress estimation should account for parallel job executions

Pig's progress estimation should account for parallel job executions
--------------------------------------------------------------------

                 Key: PIG-1883
                 URL: https://issues.apache.org/jira/browse/PIG-1883
             Project: Pig
          Issue Type: Improvement
            Reporter: Laukik Chitnis
            Assignee: Laukik Chitnis
             Fix For: 0.9.0


Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1883:
-------------------------------

    Release Note: 
The progress estimation based on critical path is used to log progress when the property pig.paratimer is set to "true". If it is set to "both", then both old and new progress indicator algorithms are used.


The paratimer is based on ideas proposed in http://www.cs.washington.edu/homes/kmorton/camera-ready.pdf .

> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch, PIG-1883-3.patch, PIG-1883.4.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1883:
--------------------------------

    Fix Version/s:     (was: 0.9.0)

> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027863#comment-13027863 ] 

jiraposter@reviews.apache.org commented on PIG-1883:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/683/#review631
-----------------------------------------------------------


This doesn't lend itself well to automated testing.  Any thoughts on how to test how the new progress indicator does versus the existing one?  Have you run any initial tests to measure this?


trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/Launcher.java
<https://reviews.apache.org/r/683/#comment1275>

    I don't understand the logic here.  Why is it 0% done if ANY job is waiting, etc.?  Some of the jobs may be done and some partially done and some not even started.



trunk/src/org/apache/pig/impl/plan/OperatorPlan.java
<https://reviews.apache.org/r/683/#comment1276>

    This code shouldn't be in OperatorPlan.  We want to keep that as clean as possible.  Instead you should build a new Walker type that can do this calculation.



trunk/src/org/apache/pig/impl/plan/OperatorPlan.java
<https://reviews.apache.org/r/683/#comment1277>

    You have tabs here and some other spots.  Please make sure you use 4 spaces rather than tabs.



trunk/src/org/apache/pig/tools/pigstats/PigProgressNotificationListener.java
<https://reviews.apache.org/r/683/#comment1278>

    Why is a separate method needed here?  When users turn on the new progress indicator I assume they don't get the old one too.  Given that the interfaces are the same it seems one method should suffice here.



trunk/src/org/apache/pig/tools/pigstats/ScriptState.java
<https://reviews.apache.org/r/683/#comment1279>

    It looks like this comment got attached to the run method.  Also, the method has only one parameter, but two are listed in the comment.


- Alan


On 2011-05-02 20:41:04, Alan Gates wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/683/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-02 20:41:04)
bq.  
bq.  
bq.  Review request for pig.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This is Laukik's patch for PIG-1883
bq.  
bq.  
bq.  This addresses bug PIG-1883.
bq.      https://issues.apache.org/jira/browse/PIG-1883
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/Main.java 1097661 
bq.    trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/Launcher.java 1097661 
bq.    trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1097661 
bq.    trunk/src/org/apache/pig/impl/plan/OperatorPlan.java 1097661 
bq.    trunk/src/org/apache/pig/scripting/SyncProgressNotificationAdaptor.java 1097661 
bq.    trunk/src/org/apache/pig/tools/pigstats/PigProgressNotificationListener.java 1097661 
bq.    trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1097661 
bq.    trunk/test/org/apache/pig/test/TestOperatorPlan.java 1097661 
bq.  
bq.  Diff: https://reviews.apache.org/r/683/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Alan
bq.  
bq.



> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laukik Chitnis updated PIG-1883:
--------------------------------

    Attachment: PIG-1883-2.patch

This patch adds methods to compute the critical path in an operator plan based on the number of nodes. It also uses the minimum progress along these many number of jobs to calculate the total progress. A new cmd line option is also added to enable this for progress reporting instead of the old estimation technique based purely on the total number of jobs.

> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029677#comment-13029677 ] 

Laukik Chitnis commented on PIG-1883:
-------------------------------------

> This doesn't lend itself well to automated testing. Any thoughts on how to test how the new progress indicator does versus the existing one? Have you run any initial tests to measure this?

Thats correct; it is difficult to test in an automated fashion. One metric for defining the performance of the progress estimator would be similar to what is used in the paratimer paper, may be a RMS of the difference from the linear time (assuming "ideal" estimates are 0.0 to 1.0 from start time to finish time) 

I manually tested it with various pig scripts that generated different kinds of physical plans. In most cases, I observed that the progress report was the same for both old and new methods. One simple case where the new method does better is when a very small and a very large job are executed in parallel. In this case, the old estimate shoots up to 50% very early, and then moves slowly to 100%, whereas the new estimate grows more gradually from 0-100 as the bigger job execution progresses. I haven't yet automated capturing these and analyzing the metric yet.


> I don't understand the logic here. Why is it 0% done if ANY job is waiting, etc.? Some of the jobs may be done and some partially done and some not even started.

The 0% is only for those set of jobs that are executing in parallel. For the set of jobs that have finished execution in the previous rounds of parallel execution, their contribution to the total estimate is 1/#rounds per round of execution i.e. per JobControl object (so, #rounds is the length of the critical path along the operator plan tree)


> This code shouldn't be in OperatorPlan. We want to keep that as clean as possible. Instead you should build a new Walker type that can do this calculation.

Ah, ok; Will do that.


> You have tabs here and some other spots. Please make sure you use 4 spaces rather than tabs.

I need to change my editor's auto-indentation formatting :)

> Why is a separate method needed here? When users turn on the new progress indicator I assume they don't get the old one too. Given that the interfaces are the same it seems one method should suffice here.

Initially, I put in a separate method assuming that users could have listeners for either of them. For example, we could use these separate listeners for the performance comparison between the old and new methods. Later on, however, when I added the command-line option to choose, I made the old and new methods as an either-or choice. Perhaps I should make it possible to have both indicators turned on at the same time?

> It looks like this comment got attached to the run method. Also, the method has only one parameter, but two are listed in the comment.

I will fix this.


> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1883:
-------------------------------

    Attachment: PIG-1883.4.patch

I have made some fixes to the way the progress logging is handled when both old and new timers are enabled.
I have also removed the -s command line option to control the behavior, and replaced with the use of a new property. This is because -s option will be deprecated in future.

> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch, PIG-1883-3.patch, PIG-1883.4.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Laukik Chitnis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laukik Chitnis updated PIG-1883:
--------------------------------

    Attachment: PIG-1883-3.patch

I have reintroduced a test file which was probably deleted when test cases related to old logical plan were ported over. However, the old OperatorPlan is still in use via the MROperPlan.

Also, this org.apache.pig.impl.plan.OperatorPlan is an abstract class (newplan.OperatorPlan is an interface) and already has methods such as size(); so I have retained the getHeight() method in this class.

I have also switched the option that chooses the new progress indication technique to take an optional parameter with which one can have both estimates (this was also useful for comparing performance)


> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch, PIG-1883-3.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027830#comment-13027830 ] 

jiraposter@reviews.apache.org commented on PIG-1883:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/683/
-----------------------------------------------------------

Review request for pig.


Summary
-------

This is Laukik's patch for PIG-1883


This addresses bug PIG-1883.
    https://issues.apache.org/jira/browse/PIG-1883


Diffs
-----

  trunk/src/org/apache/pig/Main.java 1097661 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/Launcher.java 1097661 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java 1097661 
  trunk/src/org/apache/pig/impl/plan/OperatorPlan.java 1097661 
  trunk/src/org/apache/pig/scripting/SyncProgressNotificationAdaptor.java 1097661 
  trunk/src/org/apache/pig/tools/pigstats/PigProgressNotificationListener.java 1097661 
  trunk/src/org/apache/pig/tools/pigstats/ScriptState.java 1097661 
  trunk/test/org/apache/pig/test/TestOperatorPlan.java 1097661 

Diff: https://reviews.apache.org/r/683/diff


Testing
-------


Thanks,

Alan



> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1883) Pig's progress estimation should account for parallel job executions

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112265#comment-13112265 ] 

Daniel Dai commented on PIG-1883:
---------------------------------

+1 for Thejas's additional change in MapReduceLauncher.java

> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch, PIG-1883-3.patch, PIG-1883.4.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs completed out of the total number of MR jobs. However, since the MR operators are arranged in a DAG (and hence more than 1 job might be submitted for execution in parallel), the progress estimation can be improved by considering the number of jobs in the critical path, instead of just the total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira