You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Eric Gaudet (JIRA)" <ji...@apache.org> on 2009/05/01 01:16:30 UTC

[jira] Created: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Command that selects a random sample of the rows, similar to LIMIT
------------------------------------------------------------------

                 Key: PIG-795
                 URL: https://issues.apache.org/jira/browse/PIG-795
             Project: Pig
          Issue Type: New Feature
          Components: impl
            Reporter: Eric Gaudet
            Priority: Trivial


When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 

The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.

Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705739#action_12705739 ] 

Olga Natkovich edited comment on PIG-795 at 5/4/09 1:58 PM:
------------------------------------------------------------

Patch committed. Thanks, Eric, for contributing.

      was (Author: olgan):
    Patch committed. Thanks, Eric, for sontributing.
  
> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff, sample3.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Eric Gaudet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705072#action_12705072 ] 

Eric Gaudet commented on PIG-795:
---------------------------------

Thanks for your feedback. (BTW, should these issues be discussed in a different place?)

Here's my comments:

1) I agree that the 1% minimum looks arbitrary and annoying, but I decided to keep it like this for several reasons. Most importantly, I didn't want to disturb the syntax of LIMIT, which expects an integer. Secondly, 1% is a reasonable minimum if you want a statistically significant result. And finally, you can work around the limitation by adding a 2nd level of sample (or more): b = SAMPLE a 1; c = SAMPLE b 1;   gives you 0.01%.

Now that I think about it, it's easy to change the syntax and use a float for SAMPLE. The value would be a probability between 0.0 and 1.0. It's cleaner this way, and I will send a new patch for that.

2) I implemented it in limit because they are both specialized filters in a way, with a similar syntax. This way the code changes are very small.

It already exists as a filter without any coding needed:

    b = FILTER a BY org.apache.pig.piggybank.evaluation.math.RANDOM()<0.01;

The syntax not very user friendly, though.

3) Will add unit tests in the new patch with floats.


I will produce a new patch with the float syntax and unit tests in the next few days, unless you tell me you prefer FILTER BY.


> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-795:
------------------------------

    Assignee: Eric Gaudet

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Assignee: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff, sample3.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705134#action_12705134 ] 

Alan Gates commented on PIG-795:
--------------------------------

I think it's fine to have sample as a keyword.  It's valuable not just because it's easier syntax, but because in the future it could be expanded to more sophisticated sampling techniques beyond just taking a percentage of the data.  For example:

B = SAMPLE A 1 USING 'mywhizbangnewsmaplingalgorithm';

What I meant was your patch could translate SAMPLE underneath into a filter.  Then, instead of making changes in the limit code, all you need to do is move RANDOM from piggybank into pig's builtins, and change QueryParser.jjt to do the translation form SAMPLE to FILTER.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Eric Gaudet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Gaudet updated PIG-795:
----------------------------

    Attachment: sample2.diff

This patch implements the SAMPLE command. It basically add a random sample mode to the LIMIT class. 

The syntax is like LIMIT: "a = SAMPLE x", where x is an integer and 0<=x<=100. Each row will be selected if rand()<(x/100).

Example:

    a = LOAD 'mybigdata'
    b = SAMPLE 5
    ...

will select 5% of the data.



> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705368#action_12705368 ] 

Olga Natkovich commented on PIG-795:
------------------------------------

I tested the patch and it works very well. Thanks Eric for contributing!

We are trying to commit multiquery changes at the moment. I will commit this patch in a day or two once the multiquery code is in.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff, sample3.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705104#action_12705104 ] 

Olga Natkovich commented on PIG-795:
------------------------------------

Can we implement SAMPLE the same way we implement JOIN - as a macro.

This way we will achieve the readability without much code changes. 

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Eric Gaudet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Gaudet updated PIG-795:
----------------------------

    Attachment: sample3.diff

This is the implementation of the SAMPLE operator rewritten as FILTER by the query parser, as suggested by Olga and Alan. It uses a new built-in function RANDOM(), copied from piggybank. This patch also adds the unit test TestSample. 

I am unfamiliar with LogicalPlan crafting, so the code might not be the best. Please feel free to clean it up.



> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff, sample3.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705037#action_12705037 ] 

Alan Gates commented on PIG-795:
--------------------------------

Eric,

Thanks for the patch.  I agree this is a feature that people will find useful.  I have a few questions and comments:

1) Is 1% the minimum sample size people will want to work with?  Given that data in the grid can be on the order of terabytes, I can see people wanting a 0.1% sample, or even 0.01% sample.  Maybe that's too hard to specify nicely in the syntax, or maybe people will be happy with 1% minimum.  I'm not sure, but it's worth thinking about.

2) Sample and limit aren't really related, so implementing this in limit seems artificial.  Could it instead be implemented as a filter with a random function?  So the grammar production would look like:

X = SAMPLE Y a% => X = FILTER Y BY a < RANDOM();

with RANDOM being a function you added to return a random number.

The advantage of this is we would hope in the future to push filter operators down into the load functions themselves.  intelligent load functions could then take this filter and not even deserialize a record until it decided whether it was going to be kept or not.

3) The patch should include unit tests.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich resolved PIG-795.
--------------------------------

    Resolution: Fixed

Patch committed. Thanks, Eric, for sontributing.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff, sample3.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.