You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Min Zhou <co...@gmail.com> on 2013/10/29 00:15:05 UTC

count distinct on multiple columns

Hi all,

Below script is how we count distinct on columns jid and mid

sjv =  LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
        unique = DISTINCT jv.jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2302351.

If I use code below, got another result
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP sjv ALL;
countv = FOREACH groupv {
        jid_mid = sjv.(jid, mid);
        unique = DISTINCT jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2290003.

If I concat the two columns with a delimiter never exists in jid and mid, I
got another result which I think is the correct answer of this aggregation.
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
groupv = GROUP jv ALL;
countv = FOREACH groupv {
        unique = DISTINCT jv.jid_mid;
        GENERATE COUNT(unique);
        };
dump countv;
The result is 2386385.

I did a test with below script
sjv =  LOAD '/path/of/the/data' USING AvroStorage();
groupv = GROUP jv BY (jid, mid);
unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
store unique ....;
The hadoop counters showed that there are 2386385 records written into
HDFS.  The number is as same as the 3rd pig script I list above.

Can anyone explain the difference among those three?  Whey they lead to
different results?

Regards,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: count distinct on multiple columns

Posted by Pradeep Gollakota <pr...@gmail.com>.

Great question. There seems to be some confusion about how DISTINCT
operates. I remembered (and thankfully found) this
message<http://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E>
that
explains the behavior.

As per the other post, it looks like what you've documented is expected
behavior.


On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou <co...@gmail.com> wrote:

> Hi all,
>
> Below script is how we count distinct on columns jid and mid
>
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2302351.
>
> If I use code below, got another result
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP sjv ALL;
> countv = FOREACH groupv {
>         jid_mid = sjv.(jid, mid);
>         unique = DISTINCT jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2290003.
>
> If I concat the two columns with a delimiter never exists in jid and mid, I
> got another result which I think is the correct answer of this aggregation.
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
>         unique = DISTINCT jv.jid_mid;
>         GENERATE COUNT(unique);
>         };
> dump countv;
> The result is 2386385.
>
> I did a test with below script
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP jv BY (jid, mid);
> unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
> store unique ....;
> The hadoop counters showed that there are 2386385 records written into
> HDFS.  The number is as same as the 3rd pig script I list above.
>
> Can anyone explain the difference among those three?  Whey they lead to
> different results?
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>