You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2009/03/30 22:43:50 UTC

[jira] Updated: (PIG-739) Filter in foreach seems to drop records resulting in decreased count of records

     [ https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Bhat updated PIG-739:
---------------------------

    Attachment: filter_distinctbug.pig
                testdata

Testdata and Pig script

> Filter in foreach seems to drop records resulting in decreased count of records
> -------------------------------------------------------------------------------
>
>                 Key: PIG-739
>                 URL: https://issues.apache.org/jira/browse/PIG-739
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.3.0
>            Reporter: Viraj Bhat
>             Fix For: 0.3.0
>
>         Attachments: filter_distinctbug.pig, testdata
>
>
> I have a Pig script in which I count the number of distinct records resulting from the filter, this statement is embedded in a foreach. The number of records I get with alias  TESTDATA_AGG_2 is 1.
> {code}
> TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> TESTDATA_AGG = foreach TESTDATA_GROUP {
>                         A = filter TESTDATA_FILTERED by (userid eq sessionid);
>                         C = distinct A.userid;
>                         generate group as testid, COUNT(TESTDATA_FILTERED) as counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as total_flags;
>                 }
> TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
> -- count records generated through nested foreach which contains distinct
> TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
> --explain TESTDATA_AGG_2;
> dump TESTDATA_AGG_2;
> --RESULT (1L)
> {code}
> But when I do the counting of records without the filter and distinct in the foreach I get a different value (20L)
> {code}
> TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> -- count records generated through simple foreach
> TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as total_flags;
> TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
> TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
> dump TESTDATA_AGG2_2;
> --RESULT (20L)
> {code}
> Attaching testdata

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.