You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Richipal Singh <ri...@gmail.com> on 2012/12/13 01:35:54 UTC

Group Count question

Hello All,
     I have 4 datasets

Dataset1
uid, metric, key1 , key2, key3

Dataset2
key1 , key1Category

Dataset3
key2 , key2Category

Dataset3
key3 , key3Category

JoinedRecord Dataset looks like
uid, metric, key1 , key2, key3, key1Category, key2Category,key3Category

sample joined Data:
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}

I tried the following

A = load 'joinedRecords.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();

B = group A by (metric,key1Category,key2Category,key3Category);

C = foreach B {
D = distinct A.uid;
generate flatten(group), COUNT(D);
}

dump C;

Here is what I got
(xy,Automotive,Finance,Health,1)
(xy,Automotive,Finance,Finance,1)
(xy,Automotive,Finance,Automotive,4)
(xy,Automotive,Finance,Technology,1)
(xy,Automotive,Finance,,2)


What I need is the following

(metric, Category, count (uid) from Dataset2 only, count (uid) from
Dataset3 only, count (uid) from ml only, count (uid) from (ipid OR cv OR
ml), count (uid) from (ipid AND cv AND ml)

Can someone help PLEASE.

Thanks.
--
Richipal Singh