You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by dhaval deshpande <dh...@gmail.com> on 2010/09/05 22:12:56 UTC

Filtering on bag

Hello,
         I am trying to filter tuples in bag which is generated by sequence
of operation in pig. My data looks like this.
        (0,{(0,8),(0,1),(0,6),(0,7),(0,4)})
        (1,{(1,6),(1,7),(1,8),(1,4)})
        (4,{(4,6),(4,8),(4,7)})
        (6,{(6,8),(6,7)})
        (7,{(7,8)})
        This relation is stored in R4. When I do a describe on this relation
it says like this.
        R4: {group: int,R3: {R::b: int,R1::b1: int}}

        I was trying to filter the data in the inner bag so that the one
which had smallest difference stays and rest all are filtered out. For ex
the desired output would be
        (0,{(0,1)})
        (1,{(1,4)})
        (4,{(4,6)})
        (6,{(6,7)})
        (7,{(7,8)})

        I tried doing it  like this:
        R5 = foreach R4 {
                     R6 = filter R3 by MIN(b1-b);
                     generate group;
                }
       and also some other methods but then realized this was not the proper
way of doing it and I was stuck. Then I thought I might write a UDF to
achieve it but it would be great if I could do it in Pig it self. Can anyone
help me out with this?

Thanks,
Dhaval Deshpande.

Re: Filtering on bag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Mridul has it right, just do an order-by and limit inside a foreach.

This is not the most efficient way to do this if the inner bag is very
large, but it works.

If the bag is very large, you can try looking at the UDF
ExtremalTupleByNthField that hc busy wrote for the piggybank, you can see
the patch in https://issues.apache.org/jira/browse/PIG-1386 . It's a fair
bit better because instead of sorting the whole bag, you just scan it and
keep a heap of top N elements (and with n=1 in your case, this is just a
linear scan instead of nlogn sort). This patch is in trunk and in the 0.8
branch.

Don't know if that was just a typo or actual confusion, but in case you are
confused about what exactly happens when you group things you may want to
scan this blog post:
http://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig/

-D

On Sun, Sep 5, 2010 at 3:50 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> I did not follow your pig snippet ... it looks wrong (since only output is
> 'group').
>
>
> Could you do a "order by" and then a "limit" ?
> I cant remember offhand if "order by" works within nested foreach (dont
> have pig access right now to test, sorry).
>
>
> If it is supported, something like might be what you are looking for imo:
>
> R5 = foreach R4 {
>        tmp0 = ORDER R3 by b1-b ASC;
>        tmp1 = LIMIT tmp0 1;
>        generate group, tmp1;
> }
>
>
>
> If it is not supported, then a udf might be the way to go : either one
> which picks what you need explicitly.
> Or one which does 'order by' and then you use limit (if you can reuse the
> sorting udf as a primitive elsewhere too !).
>
>
> Regards,
> Mridul
>
>
> On Monday 06 September 2010 01:42 AM, dhaval deshpande wrote:
>
>> Hello,
>>          I am trying to filter tuples in bag which is generated by
>> sequence
>> of operation in pig. My data looks like this.
>>         (0,{(0,8),(0,1),(0,6),(0,7),(0,4)})
>>         (1,{(1,6),(1,7),(1,8),(1,4)})
>>         (4,{(4,6),(4,8),(4,7)})
>>         (6,{(6,8),(6,7)})
>>         (7,{(7,8)})
>>         This relation is stored in R4. When I do a describe on this
>> relation
>> it says like this.
>>         R4: {group: int,R3: {R::b: int,R1::b1: int}}
>>
>>         I was trying to filter the data in the inner bag so that the one
>> which had smallest difference stays and rest all are filtered out. For ex
>> the desired output would be
>>         (0,{(0,1)})
>>         (1,{(1,4)})
>>         (4,{(4,6)})
>>         (6,{(6,7)})
>>         (7,{(7,8)})
>>
>>         I tried doing it  like this:
>>         R5 = foreach R4 {
>>                      R6 = filter R3 by MIN(b1-b);
>>                      generate group;
>>                 }
>>        and also some other methods but then realized this was not the
>> proper
>> way of doing it and I was stuck. Then I thought I might write a UDF to
>> achieve it but it would be great if I could do it in Pig it self. Can
>> anyone
>> help me out with this?
>>
>> Thanks,
>> Dhaval Deshpande.
>>
>
>

Re: Filtering on bag

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
I did not follow your pig snippet ... it looks wrong (since only output 
is 'group').


Could you do a "order by" and then a "limit" ?
I cant remember offhand if "order by" works within nested foreach (dont 
have pig access right now to test, sorry).


If it is supported, something like might be what you are looking for imo:

R5 = foreach R4 {
	tmp0 = ORDER R3 by b1-b ASC;
         tmp1 = LIMIT tmp0 1;
	generate group, tmp1;
}



If it is not supported, then a udf might be the way to go : either one 
which picks what you need explicitly.
Or one which does 'order by' and then you use limit (if you can reuse 
the sorting udf as a primitive elsewhere too !).


Regards,
Mridul

On Monday 06 September 2010 01:42 AM, dhaval deshpande wrote:
> Hello,
>           I am trying to filter tuples in bag which is generated by sequence
> of operation in pig. My data looks like this.
>          (0,{(0,8),(0,1),(0,6),(0,7),(0,4)})
>          (1,{(1,6),(1,7),(1,8),(1,4)})
>          (4,{(4,6),(4,8),(4,7)})
>          (6,{(6,8),(6,7)})
>          (7,{(7,8)})
>          This relation is stored in R4. When I do a describe on this relation
> it says like this.
>          R4: {group: int,R3: {R::b: int,R1::b1: int}}
>
>          I was trying to filter the data in the inner bag so that the one
> which had smallest difference stays and rest all are filtered out. For ex
> the desired output would be
>          (0,{(0,1)})
>          (1,{(1,4)})
>          (4,{(4,6)})
>          (6,{(6,7)})
>          (7,{(7,8)})
>
>          I tried doing it  like this:
>          R5 = foreach R4 {
>                       R6 = filter R3 by MIN(b1-b);
>                       generate group;
>                  }
>         and also some other methods but then realized this was not the proper
> way of doing it and I was stuck. Then I thought I might write a UDF to
> achieve it but it would be great if I could do it in Pig it self. Can anyone
> help me out with this?
>
> Thanks,
> Dhaval Deshpande.