You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2011/03/21 20:28:05 UTC

[jira] [Created] (PIG-1926) Sample/Limit should take scalar

Sample/Limit should take scalar
-------------------------------

                 Key: PIG-1926
                 URL: https://issues.apache.org/jira/browse/PIG-1926
             Project: Pig
          Issue Type: Improvement
            Reporter: Daniel Dai


Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

{code}
a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.10.patch

Attaching PIG-1926.10.patch:
Implemented support and tests for Sample.
Fixed a problem with syntactic predicates in Limit, now it does not break on inline_op.

All tests pass on my machine.
We can call this a release candidate.
I think it is ready for review.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043181#comment-13043181 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

The new patch without the formatting changes is good, it is more readable. I am reviewing it.

bq. Test whether it works in non-local mode (how should I do that without a cluster?)
MiniCluster is used to test non-local mode. You can look at the test in - http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/TestLimitAdjuster.java


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058984#comment-13058984 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

PIG-1926.11.patch - Sample also needs a check similar to that for Limit in LimitVariableValidator, to give error for statements such as - "f = sample l $0;"
LimitVariableValidator could be replaced by new class that does the check for both limit and sample operations. Can you also add a negative test case for this? Everything else looks good. 

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Description: 
Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

{code}
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as sum;
d = order a by $0;
e = limit d c.sum/100;
{code}

This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

  was:
Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

{code}
a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;
{code}

This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Status: Patch Available  (was: Open)

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1926:
----------------------------

    Description: 
Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

{code}
a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;
{code}

This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

  was:
Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:

{code}
a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;
{code}

         Labels: gsoc2011  (was: )

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a by all;
> c = foreach b generate COUNT(*) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1926:
-------------------------------

    Attachment: PIG-1926.12.1.patch

+1 for PIG-1926.12.patch. Two new files were missing apache license header files, I have added that (PIG-1926.12.1.patch) and committed it to trunk.

Gianmarco, thanks for the contribution!

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.11.patch

Attaching PIG-1926.11.patch

Rebased my work on latest trunk to solve the conflict.
Refined exception handling for LimitVariableValidator.
Added a test for exception thrown by TypeCheckingRelVisitor in TestLimitVariable (which I had forgot).

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061260#comment-13061260 ] 

Gianmarco De Francisci Morales commented on PIG-1926:
-----------------------------------------------------

Thank you!
It was a pleasure.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13042205#comment-13042205 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

bq. Is there a consistent formatting template for Pig in order to make the patch more readable? (half of the modified lines are from the code formatter).
Officially, pig is supposed to follow the Oracle/Sun coding conventions - http://wiki.apache.org/pig/HowToContribute. But in that large list, only things like use of 4 tabs for indentation and limiting length of a line to around 80 chars is what is almost always followed. 
Can you please try to disable auto-formatting and format only the sections of code that you are changing (cmd-i in eclipse on mac)? Otherwise, it is difficult to distinguish between a real change and formatting change.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054764#comment-13054764 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

There is a conflict in LogicalPlanBuilder.java, when the patch is applied to latest trunk. LimitVariableValidator.java is missing from the patch. Can you please update your svn tree and regenerate the patch ? 


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Release Note: 
Limit  and Sample now accept a variable (scalar) as argument. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

The new Limit command allows the following syntax to get the top 1% of a sorted file:
[ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ]
Only aggregate variables may be used as argument for limit. A statement like [ e = LIMIT d $0; ] is invalid.
The new Sample command allows for the same syntax.

  was:Sample and Limit now accept a variable (scalar) as argument. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

Ported TestLimitVariable to use MiniCluster.
The test works also in MR mode.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

Modified TypeCheckingRelVisitor in order to correctly cast expression in LOLimit.
Now the visitor takes into account that LOLimit has an expression.
The return type of the expression must be Integer or Long.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

First implementation of the feature.
Focusing on LIMIT for now.

- Modified the grammar to *also* allow an expression in LIMIT.
    Retained old behavior and code-path for constant expressions.
    The rule is split into 2 fragments to solve problems with ANTLR's syntactic predicates and local variables. (see The Definitive ANTLR Reference, 14.7 Issues with Actions and Syntactic Predicates, p.353)
- Modified LOLimit to keep also an expression plan.
    Constant expressions have priority over expression plans. (even though both should never be set at the same time).
- Modified LogicalPlanBuilder to allow creation of LOLimit in 2 steps.

The expression evaluation does not work yet because I need to modify POLimit accordingly, but it compiles, the changes do not disrupt the old behaviour and no runtime exception is thrown when using the new one.

Next steps:
Modify POLimit to evaluate the expression.
Evaluate which changes are required in LogToPhyTranslationVisitor to correctly create POLimit.
Evaluate which changes are required in TypeCheckingRelVisitor to ensure type safety.
Evaluate which changes are required in MRCompiler to retain correct POLimit compilation
  (LimitOptimizer breaks the current modifications).

There are some whitespace/format changes in the current patch due to automatic formatting.
Should I try to remove them?

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.7.patch

bq. The expression in limit should be allowed to refer to columns only in scalar context, ie pig should give a proper error message when such a statement is created. A Right now it allows the statement "lim = limit l $0;", and that results in NPE . You can create a validation Visitor to check this, that gets called from PigServer.Graph.compile(lp).

Thanks for pointing this out. I hadn't thought about this.
I suppose I should check that there is a scalar expression in the expression plan.
But this would disallow using UDFs, wouldn't it?
Is there another way to check the context of an expression?

bq. If the expression in limit evaluates to bytearray, it can be implicitly cast to a long. This can be done in the typechecker.

Isn't this dangerous? We would arbitrarily cast the first 8 bytes of the array to long.
What if the byte array is larger/smaller than this?
What would be the use case for such a feature?

bq. The test files are missing. Maybe you forgot to include test dir in diff?

Yes I forgot to add the file to svn. I re.added it in PIG-1926.7.patch.

bq. The following visitors also should visit the expression in limit - DotLOPrinter,LogicalPlanPrinter,SchemaResetter,UidResetter.

Addedd in PIG-1926.7.patch

bq. MRCompiler changes look good, some additional changes would be required (for sort with limit) there once optimizations in LimitOptimizer are enabled.

Sort with limit is used only in MRCompiler, right?

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.7.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

Addressed TestLogToPhyCompiler failure.
    TestLogToPhyCompiler.testLimit():
      Modified grammar file LogicalPlanGenerator.g to preserve previous AST structure.
      The previous modification disrupted tests that directly interacted with ASTs.
      The new grammar just adds expr() as a side option to the original limit_clause() rule.

Addressed TestGrunt failure.
    Modified TypeCheckingRelVisitor in order to correctly cast expression in LOLimit.
    Visit the expression plan only if present.


Disabled optimizer for variable LIMIT case.


Added simple test for variable LIMIT.


Attempt to fix things in MR layer:
    Added a limitPlan to MapReduceOper (not sure it's useful).
    Adapted the check on the existence of a limit in MROper
    Copy also the plan when creating a new POLimit in MRCompiler.


TODO:
Test whether it works in non-local mode (how should I do that without a cluster?)
Evaluate what else to change in MR layer.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045044#comment-13045044 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

{quote}
I suppose I should check that there is a scalar expression in the expression plan.
But this would disallow using UDFs, wouldn't it?
Is there another way to check the context of an expression?
{quote}
The columns will map to a ProjectExpression. If the visitor finds a ProjectExpression, it can throw an error.

{quote}
Isn't this dangerous? We would arbitrarily cast the first 8 bytes of the array to long.
What if the byte array is larger/smaller than this?
What would be the use case for such a feature?
{quote}

In pig the types are optional, and the default type is bytearray. bytearray is cast to other types based on the context in Typechecker. For example, see - http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Arithmetic+Operators 
If the bytearray cannot be cast to long, you will get an error in Result.returnStatus . 

for example - 
{code}
a = load 'x' as (id, num);
fil = filter a by id == '123'; -- assuming this returns one row
b = load 'y';
lim = limit b fil.num ; -- fil.num should be cast to long
{code}

In POLimit.getNext(Tuple t), while computing the value of the limit expression, the value of Result.returnStatus needs to be checked. If it is not STATUS_OK, it should give an error. The assert will give an error only if the assertions are explicitly enabled at runtime, so the assert should be replaced by a code that always throws an exception if the condition is not satisfied.

bq. Sort with limit is used only in MRCompiler, right?
Are you asking if the changes for use of limit expression in sort are required only in MRCompiler ? Yes, thats the only place where change should be needed, apart from LimitOptimizer.



> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.7.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

I tried to clean up the patch in order to make it more readable.
Please let me know if it is OK.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.patch

First working version for LIMIT.
POLimit correctly evaluates the limit at runtime.
AllExpressionVisitor visits LOLimit, discovers the scalar expression and correctly sets the plan.
Visitors do the visit only if the plan has been set, so the normal code path is not disrupted.


Is there a consistent formatting template for Pig in order to make the patch more readable? (half of the modified lines are from the code formatter).

Next steps:
Evaluate the interaction with optimizers.
Translate the script used for testing into a unit test for the feature.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Status: Patch Available  (was: Open)

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012424#comment-13012424 ] 

Gianmarco De Francisci Morales commented on PIG-1926:
-----------------------------------------------------

Hi, I have drafted my proposal for this project.
I assumed Daniel would be the mentor.

http://socghop.appspot.com/gsoc/proposal/review/google/gsoc2011/azaroth/1

Any comment is highly appreciated.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a by all;
> c = foreach b generate COUNT(*) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044164#comment-13044164 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

Comments on the most recent patch -
- The expression in limit should be allowed to refer to columns only in scalar context, ie pig should give a proper error message when such a statement is created. A  Right now it allows the statement "lim = limit l $0;", and that results in NPE . You can create a validation Visitor to check this, that gets called from PigServer.Graph.compile(lp).
- If the expression in limit evaluates to bytearray, it can be implicitly cast to a long. This can be done in the typechecker.  
- The test files are missing. Maybe you forgot to include test dir in diff?

I still need to review the MRCompiler changes. 

Can you use a different file name for each patch (eg PIG-1926.5.patch), and also refer to that file name in your comments? That way it is easier to identify the downloaded patch files and associate them with the comments. 


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.12.patch

PIG-1926.12.patch addresses Thejas' comments:
 * Rebased my work on current trunk.
 * Renamed LimitVariableValidator to ScalarVariableValidator.
    The new validator checks for out-of-context references to scalar variables both in Limit and Sample. To make it work for Sample, I modified LOFilter to keep track of whether the operator is the result of a Sample command (otherwise I would disallow projections in Filter).
 * Added a test for the exception in TestSample

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009714#comment-13009714 ] 

Gianmarco De Francisci Morales commented on PIG-1926:
-----------------------------------------------------

This looks quite useful.
If Ciemiewicz is already working on Reservoir sampling, I think I will apply for this one.
I assume there will probably be some integration point between the two projects.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>              Labels: gsoc2011
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a by all;
> c = foreach b generate COUNT(*) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1926:
-------------------------------

    Release Note: 
Limit  and Sample now accept a variable (scalar) as argument. 

For example, the new Limit command allows the following syntax to get the top 1% of a sorted file:
[ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ]

Only scalar variables may be used in the expression in limit or sample, columns in the input relation for the operation cannot be used in the expression. A statement like [ e = LIMIT d $0; ] is invalid.
The new Sample command allows for the same syntax.

Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

  was:
Limit  and Sample now accept a variable (scalar) as argument. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

The new Limit command allows the following syntax to get the top 1% of a sorted file:
[ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d = ORDER a BY $0; e = LIMIT d c.sum/100; ]
Only aggregate variables may be used as argument for limit. A statement like [ e = LIMIT d $0; ] is invalid.
The new Sample command allows for the same syntax.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Status: Open  (was: Patch Available)

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.10.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047811#comment-13047811 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

I agree that limit optimizations can be supported for this new feature at a later time.

PIG-1926.8.patch looks good. In testLimitVariableException1, can you also check for the error message in exception using Util.checkMessageInException ? 
Can you also add two additional test cases ? -
- test that checks for the error message when the limit expression evaluates to wrong expression type
- test with limit in a nested foreach statement


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1926:
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.10
           Status: Resolved  (was: Patch Available)

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Release Note: Sample and Limit now accept a variable (scalar) as argument. Using a variable instead of a constant in Limit automatically disables most of the optimizations (only push-before-foreach is performed). More work is needed to enable optimizations for limit-after-sort, limit duplication before cross/union and limit merging.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: PIG-1926.10.patch, PIG-1926.11.patch, PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.8.patch

bq. The expression in limit should be allowed to refer to columns only in scalar context, ie pig should give a proper error message when such a statement is created. A Right now it allows the statement "lim = limit l $0;", and that results in NPE . You can create a validation Visitor to check this, that gets called from PigServer.Graph.compile(lp).

Addressed in PIG-1926.8.patch

Added LimitVariableValidator visitor to validate variable limit expression.
      -- If a ProjectExpression is found (intead of a ScalarExpression) it throws a FrontendException
      -- Modified PigServer to call the visitor
      -- Added a test for this case in TestLimitVariable


bq. If the expression in limit evaluates to bytearray, it can be implicitly cast to a long. This can be done in the typechecker.

Addressed in PIG-1926.8.patch

   -- Added implicit cast from bytearray to long in TypeCheckingRelVisitor
   -- Added test for it in TestLimitVariable


bq. In POLimit.getNext(Tuple t), while computing the value of the limit expression, the value of Result.returnStatus needs to be checked. If it is not STATUS_OK, it should give an error. The assert will give an error only if the assertions are explicitly enabled at runtime, so the assert should be replaced by a code that always throws an exception if the condition is not satisfied.

Addressed in PIG-1926.8.patch

  -- Added runtime check of expression result in POLimit

TODO:
Exception refinement. Right now I throw mostly RuntimeExceptions.

Discuss which modifications to MRCompiler are still needed.

In my opinion optimizations at this stage are not mandatory, we can postpone them for later. As stated in the project, the important thing is that optimizers continue working for the constant case and do not break for the variable case.

If you agree I would start working on SAMPLE.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1926:
------------------------------------------------

    Attachment: PIG-1926.9.patch

Added support and test for nested limit variable in PIG-1926.9.patch.

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.7.patch, PIG-1926.8.patch, PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1926) Sample/Limit should take scalar

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044183#comment-13044183 ] 

Thejas M Nair commented on PIG-1926:
------------------------------------

The following visitors also should visit the expression in limit -  DotLOPrinter,LogicalPlanPrinter,SchemaResetter,UidResetter.
MRCompiler changes look good, some additional changes would be required (for sort with limit) there once optimizations in LimitOptimizer are enabled.


> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-1926) Sample/Limit should take scalar

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales reassigned PIG-1926:
---------------------------------------------------

    Assignee: Gianmarco De Francisci Morales

> Sample/Limit should take scalar
> -------------------------------
>
>                 Key: PIG-1926
>                 URL: https://issues.apache.org/jira/browse/PIG-1926
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>              Labels: gsoc2011
>         Attachments: PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira