You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2011/06/22 05:34:47 UTC
[jira] [Updated] (PIG-2137) SAMPLE should not be pushed above DISTINCT

     [ https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-2137:
-----------------------------------

    Description: 
I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.

Script 1, using GROUP BY to get distinct entries in the data, works:
{code}

grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = foreach (group f by $0) generate group; 
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}

Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:

{code}
grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = distinct f;
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}


  was:
I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.

Script 1, using GROUP BY to get distinct entries in the data, works:
{code}

grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = foreach (group f by $0) generate group; 
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}

Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:

grunt> f = load 'tmp/dupnumbers.txt';              
grunt> d = distinct f;
grunt> s = sample d 0.01;                          
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}



> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
>                 Key: PIG-2137
>                 URL: https://issues.apache.org/jira/browse/PIG-2137
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Critical
>
> I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';              
> grunt> d = foreach (group f by $0) generate group; 
> grunt> s = sample d 0.01;                          
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';              
> grunt> d = distinct f;
> grunt> s = sample d 0.01;                          
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira