You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Richipal Singh <ri...@gmail.com> on 2012/12/13 01:35:54 UTC
Group Count question
Hello All,
I have 4 datasets
Dataset1
uid, metric, key1 , key2, key3
Dataset2
key1 , key1Category
Dataset3
key2 , key2Category
Dataset3
key3 , key3Category
JoinedRecord Dataset looks like
uid, metric, key1 , key2, key3, key1Category, key2Category,key3Category
sample joined Data:
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}
{"uid": "E752C", "key1": 11345, "key2": 56793493, "metric": "xy",
"key1Category": "Automotive", "key3": "64674", "key2Category":
"Automotive", "key3Category": "Finance"}
I tried the following
A = load 'joinedRecords.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
B = group A by (metric,key1Category,key2Category,key3Category);
C = foreach B {
D = distinct A.uid;
generate flatten(group), COUNT(D);
}
dump C;
Here is what I got
(xy,Automotive,Finance,Health,1)
(xy,Automotive,Finance,Finance,1)
(xy,Automotive,Finance,Automotive,4)
(xy,Automotive,Finance,Technology,1)
(xy,Automotive,Finance,,2)
What I need is the following
(metric, Category, count (uid) from Dataset2 only, count (uid) from
Dataset3 only, count (uid) from ml only, count (uid) from (ipid OR cv OR
ml), count (uid) from (ipid AND cv AND ml)
Can someone help PLEASE.
Thanks.
--
Richipal Singh