You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/08/05 19:46:44 UTC

[jira] Created: (PIG-359) Semantics of generate * have changed

Semantics of generate * have changed
------------------------------------

                 Key: PIG-359
                 URL: https://issues.apache.org/jira/browse/PIG-359
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Alan Gates
            Priority: Minor
             Fix For: types_branch


In the main trunk, the script

A = load 'myfile';
B = foreach A generate *;

returns:

(x, y, z)

In the types branch, it returns:

((x, y, z))

There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Created: (PIG-359) Semantics of generate * have changed

Posted by Alan Gates <ga...@yahoo-inc.com>.
I think we're saying the same thing. 

In the UDF case, both result in the UDF getting a tuple with two fields.

In the non-UDF case, both should result in a tuple with two fields.  At 
the moment generate * results in a tuple with one field, which is a 
tuple that has two fields.  It should not.  That's the bug.

Alan.

Mridul Muralidharan wrote:
>
> Assuming 2 field schema for A, shouldn't
>
> B = foreach A generate $0, $1;
> and
> B = foreach A generate *;
>
> not be the same ?
>
> This is similar to
>
> B = foreach A generate myFunc($0, $1)
> and
> B = foreach A generate myFunc(*)
>
> The udf gets a tuple in both cases as ($0, $1) and not (($0, $1)) for 
> second case.
>
>
> Regards,
> Mridul
>
>
>
>
> Alan Gates (JIRA) wrote:
>> Semantics of generate * have changed
>> ------------------------------------
>>
>>                  Key: PIG-359
>>                  URL: https://issues.apache.org/jira/browse/PIG-359
>>              Project: Pig
>>           Issue Type: Bug
>>           Components: impl
>>     Affects Versions: types_branch
>>             Reporter: Alan Gates
>>             Priority: Minor
>>              Fix For: types_branch
>>
>>
>> In the main trunk, the script
>>
>> A = load 'myfile';
>> B = foreach A generate *;
>>
>> returns:
>>
>> (x, y, z)
>>
>> In the types branch, it returns:
>>
>> ((x, y, z))
>>
>> There is an extra level of tuple in it.  In the main branch generate 
>> * seems to include an implicit flatten.
>>
>

Re: [jira] Created: (PIG-359) Semantics of generate * have changed

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Assuming 2 field schema for A, shouldn't

B = foreach A generate $0, $1;
and
B = foreach A generate *;

not be the same ?

This is similar to

B = foreach A generate myFunc($0, $1)
and
B = foreach A generate myFunc(*)

The udf gets a tuple in both cases as ($0, $1) and not (($0, $1)) for 
second case.


Regards,
Mridul




Alan Gates (JIRA) wrote:
> Semantics of generate * have changed
> ------------------------------------
> 
>                  Key: PIG-359
>                  URL: https://issues.apache.org/jira/browse/PIG-359
>              Project: Pig
>           Issue Type: Bug
>           Components: impl
>     Affects Versions: types_branch
>             Reporter: Alan Gates
>             Priority: Minor
>              Fix For: types_branch
> 
> 
> In the main trunk, the script
> 
> A = load 'myfile';
> B = foreach A generate *;
> 
> returns:
> 
> (x, y, z)
> 
> In the types branch, it returns:
> 
> ((x, y, z))
> 
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.
> 


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-359:
-------------------------------

    Priority: Major  (was: Minor)

Bumping the priority since we see users who are trying this code running into this issue

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>             Fix For: types_branch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-359) Semantics of generate * have changed

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626779#action_12626779 ] 

Santhosh Srinivasan commented on PIG-359:
-----------------------------------------

+1 for Pig-359-2.patch. Looks good.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch, PIG-359-2.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-359:
-----------------------------------------------

    Attachment: 359.patch

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-359:
-----------------------------------------------

    Attachment: 359-1.patch

You are right olga. This is * specific. Changed the patch to include the following:
In foreach when the operator gets created it also creates a list of the leaves of its inner plans for optimization. Here I also check if the leaf of an innerplan is a project(*). If so I set flatten true for that plan. This causes the foreach logic to flatten tuples.

The same was the case in POUserFunc when you process * as an input. The semantics were different from the trunk. So changed it in a similar way to ensure the trunk behaviour. 

Because of the changes, needed to change a test case and a golden file.

All of them inculded in 359-1. Thanks Olga for reviewing.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-359:
---------------------------

    Attachment: PIG-359-2.patch

This patch addresses the issues Santhosh identified with the patch 359-1.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch, PIG-359-2.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-359) Semantics of generate * have changed

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624859#action_12624859 ] 

Olga Natkovich commented on PIG-359:
------------------------------------

Shravan, why is it always a good idea to do this? This is not * specific?

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-359) Semantics of generate * have changed

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625967#action_12625967 ] 

Shravan Matthur Narayanamurthy commented on PIG-359:
----------------------------------------------------

Alan, two things. 
1) The current code isn't enough because of the following:
A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY(*,*);
dump B;

Trunk emits 14(2 times the artiy of each tuple in A which is 7). The current code would emit two. Another example of what current code doesn't handle is

A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY($0, '---', *);
Trunk emits 9(2 + 7). Current code would emit 3.

2) You are right in saying that 'a' will be double wrapped. But thats how trunk works right now and I think its right because consider this script:

A = load 'myfile' as (a:tuple(...), b:tuple(...));
B = foreach A generate udf(a,b);

We want 'a', 'b' to be intact inside the tuple input that is being passed to the UDF. So we would expect the arity to be two instead of 2 times the arity of 'a' & 'b'. Generalizing this, I think double wrapping should be ok. The way I tested this behaviour in trunk is by writing a UDF that returns a Tuple say TupleOutputUDF, which just copies the input tuple to the output. I tried the following script in trunk:
A = load 'file:/etc/passwd' using PigStorage(':');
B = foreach A generate ARITY(TupleOutputUDF(*));
dump B;

with a return value of 1. The current code returns 7.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shravan Matthur Narayanamurthy updated PIG-359:
-----------------------------------------------

    Status: Patch Available  (was: Open)

Added a check in CreateTuple to see if we have a Single Tuple inside a Tuple and added logic to return the inner tuple if so.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-359) Semantics of generate * have changed

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-359:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I still don't like the double wrapping.  But Shravan is correct that this matches the previous behavior, and there's no good reason to change it so we shouldn't.  The patch has been checked in.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-359) Semantics of generate * have changed

Posted by "Santhosh Srinivasan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626647#action_12626647 ] 

Santhosh Srinivasan commented on PIG-359:
-----------------------------------------

In POUserFunc.java, the following code makes an assumption that project( * ) always returns a tuple. In the foreach nested block, we could be projecting bags, at which point the code will fail with ClassCastException. E.g: testNestedPlan in TestEvalPipeline.java

{code}
                
+                if(op instanceof POProject){
+                    POProject projOp = (POProject)op;
+                    if(projOp.isStar()){
+                        Tuple trslt = (Tuple) temp.result;
+                        Tuple rslt = (Tuple) res.result;
+                        for(int i=0;i<trslt.size();i++)
+                            rslt.append(trslt.get(i));
+                        continue;
+                    }
+                }
{code}

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-359) Semantics of generate * have changed

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625885#action_12625885 ] 

Alan Gates commented on PIG-359:
--------------------------------

I don't think we want the changes to POUserFunc.  In the cases of udf(*) the right thing will happen in the existing code because lines 159-161 handle making sure we don't double wrap tuples.  And removing these lines causes problems for scripts like this:

A = load 'myfile' as a:tuple (...);
B = foreach A generate udf(a);

Now 'a' will be double wrapped (that is, there will be a tuple containing just the tuple 'a').  This isn't what we want.

The changes to POForEach look good.

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 359-1.patch, 359.patch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-359) Semantics of generate * have changed

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-359:
----------------------------------

    Assignee: Shravan Matthur Narayanamurthy

Shravan, could you take a look please

> Semantics of generate * have changed
> ------------------------------------
>
>                 Key: PIG-359
>                 URL: https://issues.apache.org/jira/browse/PIG-359
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>
> In the main trunk, the script
> A = load 'myfile';
> B = foreach A generate *;
> returns:
> (x, y, z)
> In the types branch, it returns:
> ((x, y, z))
> There is an extra level of tuple in it.  In the main branch generate * seems to include an implicit flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.