You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2011/03/21 20:30:06 UTC

[jira] [Updated] (PIG-1713) SAMPLE command should accept parameters to specify alternative sampling algorithm

     [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1713:
----------------------------

    Description: 
I have a script which takes in a command line parameter.

{code}
pig -p number=100 script.pig
{code}

The script contains the following parameters:

{code}
A = load '/user/viraj/test' using PigStorage() as (a,b,c);

B = SAMPLE A 1/$number;

dump B;
{code}

Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

Ideal use case:

{code}
A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

...
...

W = group X by col1;

Z = foreach Y generate AVG(X);

AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

BB = SAMPLE AA 1/Z;

dump BB;
{code}

Viraj

Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

  was:
I have a script which takes in a command line parameter.

{code}
pig -p number=100 script.pig
{code}

The script contains the following parameters:

{code}
A = load '/user/viraj/test' using PigStorage() as (a,b,c);

B = SAMPLE A 1/$number;

dump B;
{code}

Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

Ideal use case:

{code}
A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

...
...

W = group X by col1;

Z = foreach Y generate AVG(X);

AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

BB = SAMPLE AA 1/Z;

dump BB;
{code}

Viraj

Limit should has the same case.
This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

        Summary: SAMPLE command should accept parameters to specify alternative sampling algorithm  (was: SAMPLE command should accept parameters)

> SAMPLE command should accept parameters to specify alternative sampling algorithm
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1713
>                 URL: https://issues.apache.org/jira/browse/PIG-1713
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Viraj Bhat
>              Labels: gsoc2011
>             Fix For: 0.10
>
>
> I have a script which takes in a command line parameter.
> {code}
> pig -p number=100 script.pig
> {code}
> The script contains the following parameters:
> {code}
> A = load '/user/viraj/test' using PigStorage() as (a,b,c);
> B = SAMPLE A 1/$number;
> dump B;
> {code}
> Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
> Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
> Ideal use case:
> {code}
> A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
> ...
> ...
> W = group X by col1;
> Z = foreach Y generate AVG(X);
> AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
> BB = SAMPLE AA 1/Z;
> dump BB;
> {code}
> Viraj
> Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira