You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2011/03/14 23:05:29 UTC

[jira] Created: (PIG-1904) Default split destination

Default split destination
-------------------------

                 Key: PIG-1904
                 URL: https://issues.apache.org/jira/browse/PIG-1904
             Project: Pig
          Issue Type: Bug
            Reporter: Daniel Dai


"split" statement is better to have a default destination, eg:
{code}
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
{code}

This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016017#comment-13016017 ] 

Daniel Dai commented on PIG-1904:
---------------------------------

Yes, this is a neat way for it. And this is a big opportunity for LogicalExpressionSimplifier simplifying the "otherwise" expression.

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1904:
------------------------------------------------

    Attachment: PIG-1904.1.patch

PIG-1904.1.patch contains the first working implementation of the feature.

The grammar now recognizes statements like:
    SPLIT a INTO b IF x1 < 0, c OTHERWISE;
but also like:
    SPLIT a INTO b IF x1 < 0;
This is a side-effect of making the otherwise branch optional and is a change from past behavior.
It shouldn't be a problem as the Split maps to a Filter in any case.

Implemented by copying of the other LOSplitOutput plans, and building a negated disjunction (OR) of the expressions.

Added unit test for Split-Otherwise

TODO:
Disable the feature if the expression contains a @NonDeterministic UDF.
I plan to do it by spawning a visitor on the expression.
The visitor will throw an error and explain the reason in the error message.
Is this a reasonable approach?

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1904:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Gianmarco!

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch, PIG-1904.2.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015862#comment-13015862 ] 

Gianmarco De Francisci Morales commented on PIG-1904:
-----------------------------------------------------

>From my understanding, this can be done by inserting another LOSplitOutput in the LOSplit plan such that the condition for the output is an AND of all the negated conditions specified on the command line.
Would this be a reasonable way to proceed?

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (PIG-1904) Default split destination

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1904:
----------------------------

    Issue Type: New Feature  (was: Bug)

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1904:
------------------------------------------------

    Assignee: Gianmarco De Francisci Morales
      Status: Patch Available  (was: Open)

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch, PIG-1904.2.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1904:
------------------------------------------------

    Attachment: PIG-1904.2.patch

Attaching PIG-1902.2.patch
Added unit tests for Split-Otherwise
Added a check for Nondeterministic UDF. There was no need to create my own visitor, I reused the one available in Utils.
Fixed issue with Split with 1 branch only. The solution proposed by Thejas does not work directly because the '*' is greedy, but I worked around it.

I think it is ready for review.

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch, PIG-1904.2.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069073#comment-13069073 ] 

Thejas M Nair commented on PIG-1904:
------------------------------------

Looks good. +1  . I will commit it after running unit tests. test-patch was successful .


> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch, PIG-1904.2.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066046#comment-13066046 ] 

Gianmarco De Francisci Morales commented on PIG-1904:
-----------------------------------------------------

Created PIG-2169 for this.
Anyway given the benefit/cost ratio I wouldn't try to fix it.
A Nondeterministic UDF in a Split is probably better expressed as a Sample.
Anyway I think this simple workaround should work:
{code}
a = LOAD 'a.txt' AS (f1,f2,f3);
b = FOREACH a GENERATE f1, f2, f3, NonDetUDF(f1,f2,f3) AS f4;
SPLIT b INTO c IF f4 < 0.5, D OTHERWISE;
{code}

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016035#comment-13016035 ] 

Gianmarco De Francisci Morales commented on PIG-1904:
-----------------------------------------------------

Here is an updated proposal for the GSoC that includes this feature as well.

http://socghop.appspot.com/gsoc/proposal/review/google/gsoc2011/azaroth/1

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065966#comment-13065966 ] 

Dmitriy V. Ryaboy commented on PIG-1904:
----------------------------------------

Nice catch about @NonDeterministic. Seems like it doesn't work due to the implementation details, the issue isn't fundamental. I'm cool with the partial solution for now, but please file a jira to fix this later.

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1904) Default split destination

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066287#comment-13066287 ] 

Thejas M Nair commented on PIG-1904:
------------------------------------

The approach you are proposing for @NonDeterministic udf sounds good.

PIG-1904.1.patch looks good. Some comments -

I think it is better to retain the restriction that a split needs at least two output aliases. This will prevent split being used instead of filter, and from pig becoming perl ;).

Maybe, something like - 
split_clause : SPLIT rel INTO split_branch  (COMMA split_branch)* ( COMMA split_branch ) |( COMMA split_otherwise ))


In LogicalPlanBuilder.java, I think it is better to change the assertion to a if(root == null){throw exception;}, as assertions are not enabled by default.



> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1904:
--------------------------------

    Fix Version/s: 0.10

> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1904) Default split destination

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1904:
-------------------------------

    Release Note: This feature introduces a new keyword - OTHERWISE, and that is not backward compatible - it can break scripts that use it as an alias. 

Adding a note in release notes, about how this feature affects backward compatibility. 


> Default split destination
> -------------------------
>
>                 Key: PIG-1904
>                 URL: https://issues.apache.org/jira/browse/PIG-1904
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1904.1.patch
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira