You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/05/01 18:24:30 UTC

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

    [ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705037#action_12705037 ] 

Alan Gates commented on PIG-795:
--------------------------------

Eric,

Thanks for the patch.  I agree this is a feature that people will find useful.  I have a few questions and comments:

1) Is 1% the minimum sample size people will want to work with?  Given that data in the grid can be on the order of terabytes, I can see people wanting a 0.1% sample, or even 0.01% sample.  Maybe that's too hard to specify nicely in the syntax, or maybe people will be happy with 1% minimum.  I'm not sure, but it's worth thinking about.

2) Sample and limit aren't really related, so implementing this in limit seems artificial.  Could it instead be implemented as a filter with a random function?  So the grammar production would look like:

X = SAMPLE Y a% => X = FILTER Y BY a < RANDOM();

with RANDOM being a function you added to return a random number.

The advantage of this is we would hope in the future to push filter operators down into the load functions themselves.  intelligent load functions could then take this filter and not even deserialize a record until it decided whether it was going to be kept or not.

3) The patch should include unit tests.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not necessarily randomzed. A command "SAMPLE X" would retain the row only with the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.