You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2010/02/03 02:15:18 UTC

[jira] Created: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

New load store design does not allow Pig to validate inputs and outputs up front
--------------------------------------------------------------------------------

                 Key: PIG-1216
                 URL: https://issues.apache.org/jira/browse/PIG-1216
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Alan Gates


In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  

To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-1216:
----------------------------------

    Status: Patch Available  (was: Open)

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan reopened PIG-1216:
-----------------------------------


Reopening as the assumption made for the patch doesn't hold.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835459#action_12835459 ] 

Ashutosh Chauhan commented on PIG-1216:
---------------------------------------

Referring to point # 1 of Pradeep's review comment:

So, for Zebra It is not safe to call check outputSpecs() multiple times because they create indices in this function call. So, this approach doesn't work. Proposal is to introduce validate() in storefunc api which Storer can implement in whatever way they want, thus getting rid of this restriction. In PigStorage's validate() we will call outputSpecs() since it is safe to do so there.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834637#action_12834637 ] 

Hadoop QA commented on PIG-1216:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12436067/pig-1216_1.patch
  against trunk revision 909921.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/207/console

This message is automatically generated.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-1216:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Patch committed to load-store-redesign branch - Thanks Ashutosh!

Note that only outputs will be validated up front (in line with Pig 0.6.0) - inputs will not be validated up front since for the following case validating inputs is not easy:
{code}
...
store into 'foo'...
load 'foo'...
...
{code}

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-1216:
----------------------------------

    Attachment: pig-1216_1.patch

bq. Is it ok to call outputSpecs multiple times [...]
Talked with Arun regarding this. In a user supplied OutputFormat, implementation of checkOutputSpecs() will also be provided by user. So, user needs to make sure this call is idempotent. PigStorage uses TextOutputFormat for which checkOutputSpecs() is idempotent. We need to document this fact in user manual.

bq. the test case for validation failure [...]
Done.

bq. import [...]
Done.

Result of test-patch.sh on the patch:
     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai closed PIG-1216.
---------------------------


> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan reassigned PIG-1216:
-------------------------------------

    Assignee: Ashutosh Chauhan

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834411#action_12834411 ] 

Pradeep Kamath commented on PIG-1216:
-------------------------------------

Review comments:
 * Is it ok to call outputSpecs multiple times (since we will now be calling it in the visitor and Hadoop will be calling it later when the job is launched) - hope that does not break the contract per Hadoop's OutputFormat interface
 * The test case for validation failure should ensure that PlanValidationException is indeed thrown (through some boolean flag?) - currently the code has :
{code}
} catch (PlanValidationException pve){
+           // We expect this to happen.
+        }
{code}
 * import org.omg.PortableInterceptor.SUCCESSFUL; in TestStore.java seems accidental - if you will be submitting a new patch for above comment, you can remove this import also.

Otherwise looks good.

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-1216:
----------------------------------

    Attachment: pig-1216.patch

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834608#action_12834608 ] 

Ashutosh Chauhan commented on PIG-1216:
---------------------------------------

Thinking more about this. We don't do validation on input side because the input location (or files) may get created over the course of execution of pig script, rendering such validation for input not only useless but incorrect. But similar situation may exist for output validation as well. Assume simple case of HDFS as storage and  the output location exists in HDFS. Now user may have rmf statements within the script, so output location is actually deleted before that job is executed, but if we do upfront validation Pig will fail and refuse to run script saying outputformat.checkspecs() is asserting output location exists at compile time. 
In a more general case, invariants which are true at the compile time of Pig script may no longer hold at runtime, resulting in doing such kind of validation at compile time dangerous. 

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan resolved PIG-1216.
-----------------------------------

    Resolution: Fixed

> New load store design does not allow Pig to validate inputs and outputs up front
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1216
>                 URL: https://issues.apache.org/jira/browse/PIG-1216
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: pig-1216.patch, pig-1216_1.patch
>
>
> In Pig 0.6 and before, Pig attempts to verify existence of inputs and non-existence of outputs during parsing to avoid run time failures when inputs don't exist or outputs can't be overwritten.  The downside to this was that Pig assumed all inputs and outputs were HDFS files, which made implementation harder for non-HDFS based load and store functions.  In the load store redesign (PIG-966) this was delegated to InputFormats and OutputFormats to avoid this problem and to make use of the checks already being done in those implementations.  Unfortunately, for Pig Latin scripts that run more then one MR job, this does not work well.  MR does not do input/output verification on all the jobs at once.  It does them one at a time.  So if a Pig Latin script results in 10 MR jobs and the file to store to at the end already exists, the first 9 jobs will be run before the 10th job discovers that the whole thing was doomed from the beginning.  
> To avoid this a validate call needs to be added to the new LoadFunc and StoreFunc interfaces.  Pig needs to pass this method enough information that the load function implementer can delegate to InputFormat.getSplits() and the store function implementer to OutputFormat.checkOutputSpecs() if s/he decides to.  Since 90% of all load and store functions use HDFS and PigStorage will also need to, the Pig team should implement a default file existence check on HDFS and make it available as a static method to other Load/Store function implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.