You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jordi Deu-Pons <jo...@jordeu.net> on 2010/04/30 13:32:00 UTC

UDF with two Bag one per group and one 'static'

Hi,

 I've developed an UDF that receives two bags as inputs and outputs one bag.

 One of the bags is different in every group and the other is always the
same.

 Example code:

A = LOAD 'a' AS (group, value);
B = LOAD 'b';
G = GROUP A BY group;
R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

This give an error "Error during parsing. Invalid alias: B".
I can understand this error, but I cannot realize another
 way to do this.

 Do you know which is the best way to do this?

 Thanks

-- 
a10! i fins aviat.
J:-Deu

Re: UDF with two Bag one per group and one 'static'

Posted by hc busy <hc...@gmail.com>.

But we don't want to extend PigLatin to have #define... ?

On Fri, Apr 30, 2010 at 10:04 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> http://www.stringtemplate.org/
>
> On Fri, Apr 30, 2010 at 9:57 AM, hc busy <hc...@gmail.com> wrote:
> > Is there a Java preprocessor?
> >
> > On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> I don't think there's a need to reinvent, or reimplement, the wheel
> here.
> >>
> >> You are just talking about templates. Try http://template-toolkit.org/
> >> (or any of the ruby / python variants on the theme).
> >>
> >> Or the ruby Oink DSL.
> >>
> >> -D
> >>
> >> On Fri, Apr 30, 2010 at 9:45 AM, hc busy <hc...@gmail.com> wrote:
> >> > Sometimes, I find it necessary to project before performing the group
> by.
> >> > Because there isn't support for functions or #def's it's not possible
> to
> >> > pass in which column to group by, except to project before grouping.
> >> >
> >> > A = LOAD 'a' AS (group, value);
> >> > B = LOAD 'b';
> >> > B2 = foreach B generate $5 as group, *;
> >> > G = GROUP A BY group, *B2 BY group*;
> >> > R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
> >> >
> >> > Wouldn't introducing #define in pig speed this up? Add a preprocessor
> >> > similar to the parameter substitution to support basic #define would
> be
> >> > cool.
> >> >
> >> > #define JordiGroup(t1, t2, f1, f2){
> >> >           G = group t1 by f1, t2 by f2;
> >> >           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
> >> >
> >> > }
> >> >
> >> > ... and later on
> >> >
> >> > R = JordiGroup(A, B, group, $5);
> >> >
> >> > Where the result of the #define is the last line; The implementation
> >> would
> >> > have a really simple parser to ensure () [] and {}'s match for blocks
> >> > starting with '#define'. Then it performs substitution in order the
> >> macro's
> >> > appear, no recursion is allowed.
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <ga...@yahoo-inc.com>
> wrote:
> >> >
> >> >> You need to change your group to a cogroup so that both bags are in
> your
> >> >> data stream.  If you don't want to group bag b by the same keys as a
> >> (that
> >> >> is, you want all of b available for each group of a) then you can
> load b
> >> as
> >> >> a side file inside your udf.
> >> >>
> >> >> Alan.
> >> >>
> >> >>
> >> >> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
> >> >>
> >> >>  Hi,
> >> >>>
> >> >>> I've developed an UDF that receives two bags as inputs and outputs
> one
> >> >>> bag.
> >> >>>
> >> >>> One of the bags is different in every group and the other is always
> the
> >> >>> same.
> >> >>>
> >> >>> Example code:
> >> >>>
> >> >>> A = LOAD 'a' AS (group, value);
> >> >>> B = LOAD 'b';
> >> >>> G = GROUP A BY group;
> >> >>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
> >> >>>
> >> >>> This give an error "Error during parsing. Invalid alias: B".
> >> >>> I can understand this error, but I cannot realize another
> >> >>> way to do this.
> >> >>>
> >> >>> Do you know which is the best way to do this?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> --
> >> >>> a10! i fins aviat.
> >> >>> J:-Deu
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: UDF with two Bag one per group and one 'static'

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

http://www.stringtemplate.org/

On Fri, Apr 30, 2010 at 9:57 AM, hc busy <hc...@gmail.com> wrote:
> Is there a Java preprocessor?
>
> On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> I don't think there's a need to reinvent, or reimplement, the wheel here.
>>
>> You are just talking about templates. Try http://template-toolkit.org/
>> (or any of the ruby / python variants on the theme).
>>
>> Or the ruby Oink DSL.
>>
>> -D
>>
>> On Fri, Apr 30, 2010 at 9:45 AM, hc busy <hc...@gmail.com> wrote:
>> > Sometimes, I find it necessary to project before performing the group by.
>> > Because there isn't support for functions or #def's it's not possible to
>> > pass in which column to group by, except to project before grouping.
>> >
>> > A = LOAD 'a' AS (group, value);
>> > B = LOAD 'b';
>> > B2 = foreach B generate $5 as group, *;
>> > G = GROUP A BY group, *B2 BY group*;
>> > R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
>> >
>> > Wouldn't introducing #define in pig speed this up? Add a preprocessor
>> > similar to the parameter substitution to support basic #define would be
>> > cool.
>> >
>> > #define JordiGroup(t1, t2, f1, f2){
>> >           G = group t1 by f1, t2 by f2;
>> >           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
>> >
>> > }
>> >
>> > ... and later on
>> >
>> > R = JordiGroup(A, B, group, $5);
>> >
>> > Where the result of the #define is the last line; The implementation
>> would
>> > have a really simple parser to ensure () [] and {}'s match for blocks
>> > starting with '#define'. Then it performs substitution in order the
>> macro's
>> > appear, no recursion is allowed.
>> >
>> >
>> >
>> >
>> > On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
>> >
>> >> You need to change your group to a cogroup so that both bags are in your
>> >> data stream.  If you don't want to group bag b by the same keys as a
>> (that
>> >> is, you want all of b available for each group of a) then you can load b
>> as
>> >> a side file inside your udf.
>> >>
>> >> Alan.
>> >>
>> >>
>> >> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>> >>
>> >>  Hi,
>> >>>
>> >>> I've developed an UDF that receives two bags as inputs and outputs one
>> >>> bag.
>> >>>
>> >>> One of the bags is different in every group and the other is always the
>> >>> same.
>> >>>
>> >>> Example code:
>> >>>
>> >>> A = LOAD 'a' AS (group, value);
>> >>> B = LOAD 'b';
>> >>> G = GROUP A BY group;
>> >>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>> >>>
>> >>> This give an error "Error during parsing. Invalid alias: B".
>> >>> I can understand this error, but I cannot realize another
>> >>> way to do this.
>> >>>
>> >>> Do you know which is the best way to do this?
>> >>>
>> >>> Thanks
>> >>>
>> >>> --
>> >>> a10! i fins aviat.
>> >>> J:-Deu
>> >>>
>> >>
>> >>
>> >
>>
>

Re: UDF with two Bag one per group and one 'static'

Posted by hc busy <hc...@gmail.com>.

Is there a Java preprocessor?

On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I don't think there's a need to reinvent, or reimplement, the wheel here.
>
> You are just talking about templates. Try http://template-toolkit.org/
> (or any of the ruby / python variants on the theme).
>
> Or the ruby Oink DSL.
>
> -D
>
> On Fri, Apr 30, 2010 at 9:45 AM, hc busy <hc...@gmail.com> wrote:
> > Sometimes, I find it necessary to project before performing the group by.
> > Because there isn't support for functions or #def's it's not possible to
> > pass in which column to group by, except to project before grouping.
> >
> > A = LOAD 'a' AS (group, value);
> > B = LOAD 'b';
> > B2 = foreach B generate $5 as group, *;
> > G = GROUP A BY group, *B2 BY group*;
> > R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
> >
> > Wouldn't introducing #define in pig speed this up? Add a preprocessor
> > similar to the parameter substitution to support basic #define would be
> > cool.
> >
> > #define JordiGroup(t1, t2, f1, f2){
> >           G = group t1 by f1, t2 by f2;
> >           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
> >
> > }
> >
> > ... and later on
> >
> > R = JordiGroup(A, B, group, $5);
> >
> > Where the result of the #define is the last line; The implementation
> would
> > have a really simple parser to ensure () [] and {}'s match for blocks
> > starting with '#define'. Then it performs substitution in order the
> macro's
> > appear, no recursion is allowed.
> >
> >
> >
> >
> > On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
> >
> >> You need to change your group to a cogroup so that both bags are in your
> >> data stream.  If you don't want to group bag b by the same keys as a
> (that
> >> is, you want all of b available for each group of a) then you can load b
> as
> >> a side file inside your udf.
> >>
> >> Alan.
> >>
> >>
> >> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
> >>
> >>  Hi,
> >>>
> >>> I've developed an UDF that receives two bags as inputs and outputs one
> >>> bag.
> >>>
> >>> One of the bags is different in every group and the other is always the
> >>> same.
> >>>
> >>> Example code:
> >>>
> >>> A = LOAD 'a' AS (group, value);
> >>> B = LOAD 'b';
> >>> G = GROUP A BY group;
> >>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
> >>>
> >>> This give an error "Error during parsing. Invalid alias: B".
> >>> I can understand this error, but I cannot realize another
> >>> way to do this.
> >>>
> >>> Do you know which is the best way to do this?
> >>>
> >>> Thanks
> >>>
> >>> --
> >>> a10! i fins aviat.
> >>> J:-Deu
> >>>
> >>
> >>
> >
>

Re: UDF with two Bag one per group and one 'static'

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I don't think there's a need to reinvent, or reimplement, the wheel here.

You are just talking about templates. Try http://template-toolkit.org/
(or any of the ruby / python variants on the theme).

Or the ruby Oink DSL.

-D

On Fri, Apr 30, 2010 at 9:45 AM, hc busy <hc...@gmail.com> wrote:
> Sometimes, I find it necessary to project before performing the group by.
> Because there isn't support for functions or #def's it's not possible to
> pass in which column to group by, except to project before grouping.
>
> A = LOAD 'a' AS (group, value);
> B = LOAD 'b';
> B2 = foreach B generate $5 as group, *;
> G = GROUP A BY group, *B2 BY group*;
> R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
>
> Wouldn't introducing #define in pig speed this up? Add a preprocessor
> similar to the parameter substitution to support basic #define would be
> cool.
>
> #define JordiGroup(t1, t2, f1, f2){
>           G = group t1 by f1, t2 by f2;
>           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
>
> }
>
> ... and later on
>
> R = JordiGroup(A, B, group, $5);
>
> Where the result of the #define is the last line; The implementation would
> have a really simple parser to ensure () [] and {}'s match for blocks
> starting with '#define'. Then it performs substitution in order the macro's
> appear, no recursion is allowed.
>
>
>
>
> On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> You need to change your group to a cogroup so that both bags are in your
>> data stream.  If you don't want to group bag b by the same keys as a (that
>> is, you want all of b available for each group of a) then you can load b as
>> a side file inside your udf.
>>
>> Alan.
>>
>>
>> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>>
>>  Hi,
>>>
>>> I've developed an UDF that receives two bags as inputs and outputs one
>>> bag.
>>>
>>> One of the bags is different in every group and the other is always the
>>> same.
>>>
>>> Example code:
>>>
>>> A = LOAD 'a' AS (group, value);
>>> B = LOAD 'b';
>>> G = GROUP A BY group;
>>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>>>
>>> This give an error "Error during parsing. Invalid alias: B".
>>> I can understand this error, but I cannot realize another
>>> way to do this.
>>>
>>> Do you know which is the best way to do this?
>>>
>>> Thanks
>>>
>>> --
>>> a10! i fins aviat.
>>> J:-Deu
>>>
>>
>>
>

Re: UDF with two Bag one per group and one 'static'

Posted by hc busy <hc...@gmail.com>.

Sometimes, I find it necessary to project before performing the group by.
Because there isn't support for functions or #def's it's not possible to
pass in which column to group by, except to project before grouping.

A = LOAD 'a' AS (group, value);
B = LOAD 'b';
B2 = foreach B generate $5 as group, *;
G = GROUP A BY group, *B2 BY group*;
R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

Wouldn't introducing #define in pig speed this up? Add a preprocessor
similar to the parameter substitution to support basic #define would be
cool.

#define JordiGroup(t1, t2, f1, f2){
           G = group t1 by f1, t2 by f2;
           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

}

... and later on

R = JordiGroup(A, B, group, $5);

Where the result of the #define is the last line; The implementation would
have a really simple parser to ensure () [] and {}'s match for blocks
starting with '#define'. Then it performs substitution in order the macro's
appear, no recursion is allowed.

On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> You need to change your group to a cogroup so that both bags are in your
> data stream.  If you don't want to group bag b by the same keys as a (that
> is, you want all of b available for each group of a) then you can load b as
> a side file inside your udf.
>
> Alan.
>
>
> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>
>  Hi,
>>
>> I've developed an UDF that receives two bags as inputs and outputs one
>> bag.
>>
>> One of the bags is different in every group and the other is always the
>> same.
>>
>> Example code:
>>
>> A = LOAD 'a' AS (group, value);
>> B = LOAD 'b';
>> G = GROUP A BY group;
>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>>
>> This give an error "Error during parsing. Invalid alias: B".
>> I can understand this error, but I cannot realize another
>> way to do this.
>>
>> Do you know which is the best way to do this?
>>
>> Thanks
>>
>> --
>> a10! i fins aviat.
>> J:-Deu
>>
>
>

Re: UDF with two Bag one per group and one 'static'

Posted by Jordi Deu-Pons <jo...@jordeu.net>.

Ok,

> then you can load b as a side file inside your udf.
I'll will try to implement this approach.

May be in a future it will be useful to allow a LOAD inside a FOREACH

Thanks.

On Fri, Apr 30, 2010 at 5:51 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> You need to change your group to a cogroup so that both bags are in your
> data stream.  If you don't want to group bag b by the same keys as a (that
> is, you want all of b available for each group of a) t
> Alan.
>
>
> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>
>  Hi,
>>
>> I've developed an UDF that receives two bags as inputs and outputs one
>> bag.
>>
>> One of the bags is different in every group and the other is always the
>> same.
>>
>> Example code:
>>
>> A = LOAD 'a' AS (group, value);
>> B = LOAD 'b';
>> G = GROUP A BY group;
>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>>
>> This give an error "Error during parsing. Invalid alias: B".
>> I can understand this error, but I cannot realize another
>> way to do this.
>>
>> Do you know which is the best way to do this?
>>
>> Thanks
>>
>> --
>> a10! i fins aviat.
>> J:-Deu
>>
>
>


-- 
a10! i fins aviat.
J:-Deu

Re: UDF with two Bag one per group and one 'static'

Posted by Alan Gates <ga...@yahoo-inc.com>.

You need to change your group to a cogroup so that both bags are in  
your data stream.  If you don't want to group bag b by the same keys  
as a (that is, you want all of b available for each group of a) then  
you can load b as a side file inside your udf.

Alan.

On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

> Hi,
>
> I've developed an UDF that receives two bags as inputs and outputs  
> one bag.
>
> One of the bags is different in every group and the other is always  
> the
> same.
>
> Example code:
>
> A = LOAD 'a' AS (group, value);
> B = LOAD 'b';
> G = GROUP A BY group;
> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>
> This give an error "Error during parsing. Invalid alias: B".
> I can understand this error, but I cannot realize another
> way to do this.
>
> Do you know which is the best way to do this?
>
> Thanks
>
> -- 
> a10! i fins aviat.
> J:-Deu