You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2011/04/05 00:19:05 UTC

[jira] [Created] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

in nested foreach, accumutive udf taking input from order-by does not get results in order
------------------------------------------------------------------------------------------

                 Key: PIG-1963
                 URL: https://issues.apache.org/jira/browse/PIG-1963
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
            Reporter: Thejas M Nair


This happens only when secondary sort is not being used for the order-by. 
For example -
{code}
a1 = load 'fruits.txt' as (f1:int,f2);
a2 = load 'fruits.txt' as (f1:int,f2);

b = cogroup a1 by f1, a2 by f1;

d = foreach b {
   sort1 = order a1 by f2;
   sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
   generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
}

-- explain d;
dump d;
{code}



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017500#comment-13017500 ] 

Thejas M Nair commented on PIG-1963:
------------------------------------

bq. This prevent other nested relational operator, right? From the way checkUDFInput handles PORelationToExprProject, it seems the original code intend to prevent other relational operator (except foreach) as well, but miss the case POProject following PORelationToExprProject. +1 for the patch if this is the case.

Yes, the AccumulatorOptimizer code intends to turn off accumulative mode if it sees any relational operator other than POForEach and POSortedDistinct as input to accumulative udf. The case of POProject should have been handled like PORelationToExprProject. 


> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1963:
-------------------------------

    Attachment: PIG-1963.1.1.patch

PIG-1963.1.1.patch - patch to remove a test case added in PIG-1911, as that query will no longer run in accumulative mode.

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.1.patch, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015661#comment-13015661 ] 

Thejas M Nair commented on PIG-1963:
------------------------------------

MYCONCATBAG udf in the query in description concatenates the entries in the bag, in the order it is recieved.
When the query run with the property - pig.accumulative.batchsize=2 , 
and input -
{code}
100     apple
200     orange
300     strawberry
300     pear
100     apple
300     pear
400     apple
{code}

gives output -
{code}
(100,(100)(100),(apple)(apple))
(200,(200),(orange))
(300,(300)(300)(300),(pear)(strawberry)(pear)) -- this should be (300,(300)(300)(300),(pear)(pear)(strawberry))
(400,(400),(apple))
{code}

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1963:
-------------------------------

    Attachment: MYCONCATBAG.java

attaching udf used in the example.

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>         Attachments: MYCONCATBAG.java
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair resolved PIG-1963.
--------------------------------

    Resolution: Fixed
      Assignee: Thejas M Nair

Patch committed to trunk and 0.8 branch.


> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.9.0, 0.8.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.1.patch, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015665#comment-13015665 ] 

Thejas M Nair commented on PIG-1963:
------------------------------------

Note that the issue is seen only when there are more than 20000 in the bag used by the nested order-by statement, or the value of pig.accumulative.batchsize property if it is set.

The is happening because in accumulative mode the nested relational operator is being passed a portion of the bag. That works fine in case of operations such as filter or limit. If secondary sort is used for the ordering, there is no POSort in the plan, so it works fine.

This issue might exist in case of nested distinct as well, because it is also supposed to be a blocking operation.

Another query which demonstrates this issue (when property pig.accumulative.batchsize=2 is set)

{code}
a1 = load 'fruits.txt' as (cid:int,fruit : chararray);

b = group a1 by cid;

d = foreach b {
  sort1 = order a1 by fruit ;
  sort2 = order a1 by fruit desc;
  generate group as cid, MYCONCATBAG(sort1.fruit), MYCONCATBAG(sort2.fruit); -- The second instance of the udf does not get sorted results
}

explain d;
 dump d;
{code}

To fix this, if such blocking relational operators exist in the plan after secondary-sort optimization, accumulative mode should be disabled by the optimizer.


> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>         Attachments: MYCONCATBAG.java
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017339#comment-13017339 ] 

Daniel Dai commented on PIG-1963:
---------------------------------

This prevent other nested relational operator, right? From the way checkUDFInput handles PORelationToExprProject, it seems the original code intend to prevent other relational operator (except foreach) as well, but miss the case POProject following PORelationToExprProject. +1 for the patch if this is the case.

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1963:
-------------------------------

    Fix Version/s: 0.8.0
                   0.9.0

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017503#comment-13017503 ] 

Thejas M Nair commented on PIG-1963:
------------------------------------

The AccumulatorOptimizer should allow accumulative mode to be used if the input relation is a non-blocking relation like filter or limit. I have created PIG-1980 to address that.


> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.8.0, 0.9.0
>
>         Attachments: MYCONCATBAG.java, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1963) in nested foreach, accumutive udf taking input from order-by does not get results in order

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1963:
-------------------------------

    Attachment: PIG-1963.1.patch

PIG-1963.1.patch - unit tests, test-patch with 0.8 branch. Running them for trunk.

> in nested foreach, accumutive udf taking input from order-by does not get results in order
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-1963
>                 URL: https://issues.apache.org/jira/browse/PIG-1963
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>         Attachments: MYCONCATBAG.java, PIG-1963.1.patch
>
>
> This happens only when secondary sort is not being used for the order-by. 
> For example -
> {code}
> a1 = load 'fruits.txt' as (f1:int,f2);
> a2 = load 'fruits.txt' as (f1:int,f2);
> b = cogroup a1 by f1, a2 by f1;
> d = foreach b {
>    sort1 = order a1 by f2;
>    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
>    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
> }
> -- explain d;
> dump d;
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira