You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Bill Graham (Created) (JIRA)" <ji...@apache.org> on 2012/03/14 06:39:43 UTC

[jira] [Created] (PIG-2587) Compute LogicalPlan signature and store in job conf

Compute LogicalPlan signature and store in job conf
---------------------------------------------------

                 Key: PIG-2587
                 URL: https://issues.apache.org/jira/browse/PIG-2587
             Project: Pig
          Issue Type: Improvement
            Reporter: Bill Graham
            Assignee: Bill Graham


We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:

# Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
# In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.

(1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Julien Le Dem (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240993#comment-13240993 ] 

Julien Le Dem commented on PIG-2587:
------------------------------------

@Jonathan I think getting the signature exactly right would be hard with the extra issue that every change to improve the signature instantly invalidates any cache based on the signature. The case where the script is modified in a way that doesn't change anything to the physical plan seems marginal.

This looks good to me.

Outside of the scope of this patch: Things that impact the physical plan as well and should probably be used as part of the look up:
 - version of Pig
 - optimizer flags
 - version of registered jars


                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: 0.10_blocker
>             Fix For: 0.10, 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Jonathan Coveney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2587:
----------------------------------

    Fix Version/s: 0.11
                   0.10
    
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>             Fix For: 0.10, 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Bill Graham (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241007#comment-13241007 ] 

Bill Graham commented on PIG-2587:
----------------------------------

I agree if cosmetic changes happen to the script, all bets are off and you'll get a different signature.

Also agree about the 3 items out of scope here. The version of registered jars part would be ugly due to potential transitive dependancies changing and not being detected. 
                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: 0.10_blocker
>             Fix For: 0.10, 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Bill Graham (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2587:
-----------------------------

    Attachment: pig-2587_1.patch

Here's a first pass of a proposed implementation.
                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239263#comment-13239263 ] 

Jonathan Coveney commented on PIG-2587:
---------------------------------------

Bill,

There are a couple of ways to implement a signature like this. One is to just do the hashCode, which is what you did...that will be good for identical scripts. I wonder if it might be worth thinking about some sort of value that wouldn't change with cosmetic changes to the script (ie alias changes and the like)? I guess a signature is one thing, and the hashCode would be adequate, but ideally as long as the sources and transformations are the same, you'd want cosmetic changes not to throw out the tuning you've done.

Is that crazy talk? 80/20 may dictate just going with this approach since it is so simple and saving the bigger optimization for external systems.
                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>             Fix For: 0.10, 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Julien Le Dem (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249057#comment-13249057 ] 

Julien Le Dem commented on PIG-2587:
------------------------------------

+1
                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: 0.10_blocker
>             Fix For: 0.10.0, 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PIG-2587) Compute LogicalPlan signature and store in job conf

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham resolved PIG-2587.
------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.10.0)

Committed.
                
> Compute LogicalPlan signature and store in job conf
> ---------------------------------------------------
>
>                 Key: PIG-2587
>                 URL: https://issues.apache.org/jira/browse/PIG-2587
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>             Fix For: 0.11
>
>         Attachments: pig-2587_1.patch
>
>
> We'd like to be able to uniquely identify a re-executed script (possibly with different inputs/outputs) by creating a signature of the {{LogicalPlan}}. Here's the proposal:
> # Add a new method {{LogicalPlan.getSignature()}} that returns a hash of its {{LogicalPlanPrinter}} output.
> # In {{PigServer.execute()}} set the signature on the job conf after the LP is compiled, but before it's executed.
> (1) would allow an impl of {{PigProgressNotificationListener.setScriptPlan()}} to save the LP signature with the script metadata. Upon subsequent runs (2) would allow an impl of {{PigReducerEstimator}} (see PIG-2574) to retrieve the current LP signature and fetch the historical data for the script. It could then use the previous run data to better estimate the number of reducers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira