You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2012/08/24 11:35:01 UTC

filter duplicates from a bag

Hi there,

What is the best way to retrieve duplicates from a bag. I basically would
like to do something like the opposite of DISTINCT.

A: {userid: long,foo: long,bar: long}

dump A
(1,2,3)
(1,2,3)
(1,3,2)
(2,3,1)

Now I would like to have a bag which contains
(1,2,3)
(1,2,3)

Thanks,
-Marco

Re: filter duplicates from a bag

Posted by Marco Cadetg <ma...@zattoo.com>.
Thanks Gianmarco, that is what I was looking for!
-Marco

On Fri, Aug 24, 2012 at 12:19 PM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> I would say something along these lines:
>
> B = group A by *;
> C = foreach B generate group, COUNT(A) as count;
> D = filter C by count > 1;
> E = foreach D generate group;
>
> Disclaimer: untested code.
>
> Cheers,
> --
> Gianmarco
>
>
>
> On Fri, Aug 24, 2012 at 11:35 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Hi there,
> >
> > What is the best way to retrieve duplicates from a bag. I basically would
> > like to do something like the opposite of DISTINCT.
> >
> > A: {userid: long,foo: long,bar: long}
> >
> > dump A
> > (1,2,3)
> > (1,2,3)
> > (1,3,2)
> > (2,3,1)
> >
> > Now I would like to have a bag which contains
> > (1,2,3)
> > (1,2,3)
> >
> > Thanks,
> > -Marco
> >
>

Re: filter duplicates from a bag

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
I would say something along these lines:

B = group A by *;
C = foreach B generate group, COUNT(A) as count;
D = filter C by count > 1;
E = foreach D generate group;

Disclaimer: untested code.

Cheers,
--
Gianmarco



On Fri, Aug 24, 2012 at 11:35 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> Hi there,
>
> What is the best way to retrieve duplicates from a bag. I basically would
> like to do something like the opposite of DISTINCT.
>
> A: {userid: long,foo: long,bar: long}
>
> dump A
> (1,2,3)
> (1,2,3)
> (1,3,2)
> (2,3,1)
>
> Now I would like to have a bag which contains
> (1,2,3)
> (1,2,3)
>
> Thanks,
> -Marco
>