You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Mithun Radhakrishnan (Created) (JIRA)" <ji...@apache.org> on 2012/03/10 03:24:57 UTC

[jira] [Created] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Multiple Store-commands mess up mapred.output.dir.
--------------------------------------------------

                 Key: PIG-2578
                 URL: https://issues.apache.org/jira/browse/PIG-2578
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.9.2, 0.8.1
            Reporter: Mithun Radhakrishnan


When one runs a pig-script with multiple storers, one sees the following:
1. When run as a script, Pig launches a single job.
2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 

This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
(https://issues.apache.org/jira/browse/HCATALOG-276)

Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):

a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
split a into b if key<200, c if key >=200;
store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();

I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.

Thanks.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438929#comment-13438929 ] 

Raghu Angadi commented on PIG-2578:
-----------------------------------

Thanks for the analysis Rohini. +1 for reverting this patch. 

For the larger issue, I think Pig should clearly define the contract for job/conf passed setLocation() and setStoreLocation() so the user's StoreFunc can be implemented properly. I would suggest resisting the temptation to say "this method might be called any number of times" (a variant of this appears multiple places in Pig interface). While this made UDF implementors think twice about what they are doing, it allowed Pig to implement work arounds rather than proper fixes (i.e. why is "setStoreLocation()" called so many places?).

                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433834#comment-13433834 ] 

Bill Graham commented on PIG-2578:
----------------------------------

Regarding the wrapper job conf, in some cases I'm sure it's justified to set a conf. What if we throw an exception if a value set attempt occurs where a different value already exists? We could include messaging about how UDFContext if probably what they want. This approach would be backward compatible with jobs that use conf properly with a single-store job, for example.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432880#comment-13432880 ] 

Bill Graham commented on PIG-2578:
----------------------------------

This patch has caused some issues, see PIG-2870.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253654#comment-13253654 ] 

Daniel Dai commented on PIG-2578:
---------------------------------

There two folds of this issue:
1. I found one bug in Pig which we didn't make a copy of hadoop configuration before invoking various StoreFunc hooks. This is more obvious under hadoop 23 for HCat when hadoop need "mapred.output.dir" in OutputCommitter to move promote output.
2. There is also a fix on HCat side. setStoreLocation suppose to set up the right "mapred.output.dir"
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433810#comment-13433810 ] 

Dmitriy V. Ryaboy commented on PIG-2578:
----------------------------------------

I am aware of many StoreFunc implementations that rely on being able to mess with the JobConf. This is an undocumented and backwards incompatible change.. I can see why we need it, but the proper way to do this would be to document it, provide explicit instructions on using UDFContext (and how/where/when to get it), and migrate piggybank and builtin storefuncs that rely on mutable jobconfs. Further, to make the contract clear, rather than passing in a dummy new jobConf that gets GC'd immediately, we should pass in a wrapper job conf which throws exceptions on any set() call, to prevent surprises.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Ashutosh Chauhan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228506#comment-13228506 ] 

Ashutosh Chauhan commented on PIG-2578:
---------------------------------------

I am finding this a bit surprising. If this were to be true, multi-query cannot work effectively, since then both stores using FileOutputFormat will effectively write in same directory, messing up the outputs of each other. I suspect problem may exist in HCatalog. A test-case independent of HCatalog demonstrating Pig bug will be highly appreciated. 
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253661#comment-13253661 ] 

Daniel Dai commented on PIG-2578:
---------------------------------

HCat side fix is part of HCATALOG-375.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Rohini Palaniswamy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433772#comment-13433772 ] 

Rohini Palaniswamy commented on PIG-2578:
-----------------------------------------

Spoke with Daniel. He said it was intentional to make the JobConf read-only so that each store does not override another. But the problem with that is it does not allow addition of Credentials and setting of JT specific config like Distributed cache configuration on the Job. We need a more cleaner solution to solve it and prevent StoreFunc implementations to not put something in JobConf that will not work correctly with multiple stores. We got rid of the multiple stores messing up problem in hcat by putting the properties in UDFContext instead of Job. But cannot expect all StoreFunc implementations to do that unless forced to which was the intention of this JIRA. 
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Thejas M Nair (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253662#comment-13253662 ] 

Thejas M Nair commented on PIG-2578:
------------------------------------

+1
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai resolved PIG-2578.
-----------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed

Patch committed to 0.10/trunk.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2578:
----------------------------

    Attachment: PIG-2578-1.patch

Attach the patch in Pig side.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2578:
----------------------------

    Fix Version/s: 0.11
                   0.10.0
         Assignee: Daniel Dai
    
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438959#comment-13438959 ] 

Bill Graham commented on PIG-2578:
----------------------------------

I think the problem is not that as much that setStoreLocation can get called multiple times, but that from the Javadocs it's not clear what the effects (or side-effects) will occur when setStoreLocation sets values in the Config.

+1 on reverting this patch and adding better javadocs for starters since the build is currently broken for a number of common use cases. We can then add examples and safeguards to illustrate proper usage in a multi-store environment.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Rohini Palaniswamy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434548#comment-13434548 ] 

Rohini Palaniswamy commented on PIG-2578:
-----------------------------------------

Did some debugging with and without PIG-2578. Multiple storage using PigStorage worked fine in both cases. This is because before every getOutputFormat call, there is a setLocation with a copy of JobContext or TaskAttemptContext and that copy was passed to getOutputCommitter(), getRecordWriter() or checkOutputSpecs() calls. So the output format actually runs with the correct configuration. So multiple store commands don't always get messed up. The corner case problem I see is that, if one instance of the store set a configuration to a specific value and another instance of the store does not set any value at all for that config it will still get the config with the value set from the copy of the job put by the first instance(without PIG-2578).

The actual problem was with the hcat code when this jira was filed. It set the mapred.output.dir and lot of other properties in front end but not in the backened during setStoreLocation. 
http://svn.apache.org/viewvc/incubator/hcatalog/branches/branch-0.4/src/java/org/apache/hcatalog/pig/HCatStorer.java?revision=1325867&view=markup
If it had set the mapred.output.dir in the backend also, it would have worked fine. It was later fixed to do so.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439103#comment-13439103 ] 

Daniel Dai commented on PIG-2578:
---------------------------------

I am fine with reverting the patch. The underlying problem is setStoreLocation is the only hook for StoreFunc for multiple purpose. In the javadoc, we shall make it clear:
1. Need to distinguish frontend/backend (using UDFContext.isFrontend()), user can setup global configuration in the frontend, but can only setup store only configuration in the backend
2. When setting up global configuration, need to bear in mind there could be multiple store, so config entries can overwrite each other.
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2578) Multiple Store-commands mess up mapred.output.dir.

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440912#comment-13440912 ] 

Dmitriy V. Ryaboy commented on PIG-2578:
----------------------------------------

Reverted in PIG-2890. I don't see a way to reopen this jira and change it to won't fix..
                
> Multiple Store-commands mess up mapred.output.dir.
> --------------------------------------------------
>
>                 Key: PIG-2578
>                 URL: https://issues.apache.org/jira/browse/PIG-2578
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2
>            Reporter: Mithun Radhakrishnan
>            Assignee: Daniel Dai
>             Fix For: 0.10.0, 0.11
>
>         Attachments: PIG-2578-1.patch
>
>
> When one runs a pig-script with multiple storers, one sees the following:
> 1. When run as a script, Pig launches a single job.
> 2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations. 
> This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
> (https://issues.apache.org/jira/browse/HCATALOG-276)
> Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):
> a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
> split a into b if key<200, c if key >=200;
> store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
> store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();
> I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira