You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/02 20:05:48 UTC
count duplicate entries
Hi,
I have data in hdfs like:
id1,field1,field2
1,2,3
1,2,3
1,2,4
1,2,5
I want to find the number of unique entries using pig..
So here, number of unique entries are 3 ( as 1,2,3 is repeated twice)
How do i find this?
Thanks
Re: count duplicate entries
Posted by Arun Ahuja <aa...@gmail.com>.
You can solve this using the DISTINCT operator to solve this, it will give
you only the unique entries and than you can count them.
Example:
data = LOAD '...' USING PigStorage() as (id:int, field1:chararray,
field2:chararray);
unique_data = DISTINCT data;
unique_count = FOREACH (GROUP unique_data all) GENERATE COUNT($1);
dump unique_count;
On Tue, Apr 2, 2013 at 2:05 PM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
> I have data in hdfs like:
>
> id1,field1,field2
> 1,2,3
> 1,2,3
> 1,2,4
> 1,2,5
> I want to find the number of unique entries using pig..
> So here, number of unique entries are 3 ( as 1,2,3 is repeated twice)
>
> How do i find this?
>
> Thanks
>