You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2009/03/30 22:43:50 UTC
[jira] Updated: (PIG-739) Filter in foreach seems to drop records
resulting in decreased count of records
[ https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-739:
---------------------------
Attachment: filter_distinctbug.pig
testdata
Testdata and Pig script
> Filter in foreach seems to drop records resulting in decreased count of records
> -------------------------------------------------------------------------------
>
> Key: PIG-739
> URL: https://issues.apache.org/jira/browse/PIG-739
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.3.0
> Reporter: Viraj Bhat
> Fix For: 0.3.0
>
> Attachments: filter_distinctbug.pig, testdata
>
>
> I have a Pig script in which I count the number of distinct records resulting from the filter, this statement is embedded in a foreach. The number of records I get with alias TESTDATA_AGG_2 is 1.
> {code}
> TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> TESTDATA_AGG = foreach TESTDATA_GROUP {
> A = filter TESTDATA_FILTERED by (userid eq sessionid);
> C = distinct A.userid;
> generate group as testid, COUNT(TESTDATA_FILTERED) as counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as total_flags;
> }
> TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
> -- count records generated through nested foreach which contains distinct
> TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
> --explain TESTDATA_AGG_2;
> dump TESTDATA_AGG_2;
> --RESULT (1L)
> {code}
> But when I do the counting of records without the filter and distinct in the foreach I get a different value (20L)
> {code}
> TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> -- count records generated through simple foreach
> TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as total_flags;
> TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
> TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
> dump TESTDATA_AGG2_2;
> --RESULT (20L)
> {code}
> Attaching testdata
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.