You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2010/02/23 03:15:27 UTC

[jira] Created: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Diamond splitter does not generate correct results when using Multi-query optimization
--------------------------------------------------------------------------------------

                 Key: PIG-1252
                 URL: https://issues.apache.org/jira/browse/PIG-1252
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.6.0
            Reporter: Viraj Bhat
             Fix For: 0.7.0


I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
                               orderedData = ORDER trueDataTmp BY col1,col2;
                               GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                              }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840883#action_12840883 ] 

Dmitriy V. Ryaboy commented on PIG-1252:
----------------------------------------

Richard,
Is there any documentation on what the secondary key optimization does, when it kicks in, benchmarks of how much improvement it provides, and hints on what the expected tradeoffs would be?


> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Attachment: PIG-1252.patch

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1252:
----------------------------

    Attachment: PIG-1252-2.patch

The root cause of this problem should be the wrong plan cloner. Attach a preliminary patch to see if it works.

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843760#action_12843760 ] 

Hadoop QA commented on PIG-1252:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12438263/PIG-1252-2.patch
  against trunk revision 921185.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/232/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/232/console

This message is automatically generated.

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Attachment:     (was: PIG-1252.patch)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Attachment: PIG-1252.patch

This is the result of diamond query optimizer merging a job that has secondary key optimization. This patch disallows such merge.

In practice, users should consider the performance trade-off between using multiquery optimization and using secondary key optimization. Right now the secondary key optimizer runs before the multiquery optimizer which now doesn't merge any job that has secondary key optimization.

To disable multiquery optimization, use option -M. To disable secondary key optimization, use option -Dpig.exec.nosecondarykey=true.

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1252:
----------------------------

    Status: Patch Available  (was: Open)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843764#action_12843764 ] 

Richard Ding commented on PIG-1252:
-----------------------------------

+1 for Daniel's patch

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-1252:
----------------------------

    Description: 
I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
                               orderedData = ORDER trueDataTmp BY col1,col2;
                               GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                              }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.

Viraj

  was:
I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows

{code}

loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData {
                               orderedData = ORDER trueDataTmp BY col1,col2;
                               GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                              }

dump finalData;

{code}


You can see that "falseDataTmp" is untouched.

When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.

Viraj


> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Status: Patch Available  (was: Open)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Status: Patch Available  (was: Open)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai closed PIG-1252.
---------------------------


> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1252:
----------------------------

    Status: Open  (was: Patch Available)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252-2.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840339#action_12840339 ] 

Viraj Bhat commented on PIG-1252:
---------------------------------

A modified version of the script works, does this have to do with nested foreach? 

{code}
loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                             
dump finalData;
{code}

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841010#action_12841010 ] 

Hadoop QA commented on PIG-1252:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12437777/PIG-1252.patch
  against trunk revision 917827.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/232/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/232/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/232/console

This message is automatically generated.

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding updated PIG-1252:
------------------------------

    Status: Open  (was: Patch Available)

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840889#action_12840889 ] 

Richard Ding commented on PIG-1252:
-----------------------------------

The secondary key optimization is documented in PIG-1038.   

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Ding reassigned PIG-1252:
---------------------------------

    Assignee: Richard Ding

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842155#action_12842155 ] 

Hadoop QA commented on PIG-1252:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12438039/PIG-1252.patch
  against trunk revision 919628.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/235/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/235/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/235/console

This message is automatically generated.

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>         Attachments: PIG-1252.patch, PIG-1252.patch
>
>
> I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.