You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/02 20:05:48 UTC

count duplicate entries

Hi,
 I have data in hdfs like:

id1,field1,field2
1,2,3
1,2,3
1,2,4
1,2,5
I want to find the number of unique entries using pig..
So here, number of unique entries are 3 ( as 1,2,3 is repeated twice)

How do i find this?

Thanks

Re: count duplicate entries

Posted by Arun Ahuja <aa...@gmail.com>.
You can solve this using the DISTINCT operator to solve this, it will give
you only the unique entries and than you can count them.

Example:

data = LOAD '...' USING PigStorage() as (id:int, field1:chararray,
field2:chararray);
unique_data = DISTINCT data;
unique_count = FOREACH (GROUP unique_data all) GENERATE COUNT($1);
dump unique_count;


On Tue, Apr 2, 2013 at 2:05 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>  I have data in hdfs like:
>
> id1,field1,field2
> 1,2,3
> 1,2,3
> 1,2,4
> 1,2,5
> I want to find the number of unique entries using pig..
> So here, number of unique entries are 3 ( as 1,2,3 is repeated twice)
>
> How do i find this?
>
> Thanks
>