You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "David Ciemiewicz (JIRA)" <ji...@apache.org> on 2011/01/25 18:16:44 UTC

[jira] Commented: (PIG-1713) SAMPLE command should accept parameters

    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986521#action_12986521 ] 

David Ciemiewicz commented on PIG-1713:
---------------------------------------

An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements.

http://en.wikipedia.org/wiki/Reservoir_sampling

Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required:

http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

> SAMPLE command should accept parameters
> ---------------------------------------
>
>                 Key: PIG-1713
>                 URL: https://issues.apache.org/jira/browse/PIG-1713
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>             Fix For: 0.9.0
>
>
> I have a script which takes in a command line parameter.
> {code}
> pig -p number=100 script.pig
> {code}
> The script contains the following parameters:
> {code}
> A = load '/user/viraj/test' using PigStorage() as (a,b,c);
> B = SAMPLE A 1/$number;
> dump B;
> {code}
> Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
> Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
> Ideal use case:
> {code}
> A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
> ...
> ...
> W = group X by col1;
> Z = foreach Y generate AVG(X);
> AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
> BB = SAMPLE AA 1/Z;
> dump BB;
> {code}
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.