You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Min Zhou <co...@gmail.com> on 2013/10/29 00:15:05 UTC
count distinct on multiple columns
Hi all,
Below script is how we count distinct on columns jid and mid
sjv = LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
unique = DISTINCT jv.jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2302351.
If I use code below, got another result
sjv = LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP sjv ALL;
countv = FOREACH groupv {
jid_mid = sjv.(jid, mid);
unique = DISTINCT jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2290003.
If I concat the two columns with a delimiter never exists in jid and mid, I
got another result which I think is the correct answer of this aggregation.
sjv = LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
unique = DISTINCT jv.jid_mid;
GENERATE COUNT(unique);
};
dump countv;
The result is 2386385.
I did a test with below script
sjv = LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP jv BY (jid, mid);
unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
store unique ....;
The hadoop counters showed that there are 2386385 records written into
HDFS. The number is as same as the 3rd pig script I list above.
Can anyone explain the difference among those three? Whey they lead to
different results?
Regards,
Min
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: count distinct on multiple columns
Posted by Pradeep Gollakota <pr...@gmail.com>.
Great question. There seems to be some confusion about how DISTINCT
operates. I remembered (and thankfully found) this
message<http://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E>
that
explains the behavior.
As per the other post, it looks like what you've documented is expected
behavior.
On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou <co...@gmail.com> wrote:
> Hi all,
>
> Below script is how we count distinct on columns jid and mid
>
> sjv = LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
> unique = DISTINCT jv.jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2302351.
>
> If I use code below, got another result
> sjv = LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP sjv ALL;
> countv = FOREACH groupv {
> jid_mid = sjv.(jid, mid);
> unique = DISTINCT jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2290003.
>
> If I concat the two columns with a delimiter never exists in jid and mid, I
> got another result which I think is the correct answer of this aggregation.
> sjv = LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
> unique = DISTINCT jv.jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2386385.
>
> I did a test with below script
> sjv = LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP jv BY (jid, mid);
> unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
> store unique ....;
> The hadoop counters showed that there are 2386385 records written into
> HDFS. The number is as same as the 3rd pig script I list above.
>
> Can anyone explain the difference among those three? Whey they lead to
> different results?
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>