You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mridul Muralidharan <mr...@yahoo-inc.com> on 2009/02/09 12:10:15 UTC

Pig 2.0 operators

Hi,

   Have following queries while going through types func spec.


a) What does MATCHES on two bytearrays mean ? Spec says it is supported 
without any comment.

b) Multiplication/Division between bag/tuple and primitives - says it is 
not implemented, but what is the expectation when it does get done ? 
Apply to individual fields recursively ?

c) What does CONCAT of two bytearrays mean ? Just combining both arrays 
into a new larger array through array copies ? (I am assuming this is 
what concat of chararray does)

d) For aggregate functions MIN and MAX, can we provide our own 
comparator (udf or otherwise) for the chararrays - to define what the 
relative ordering is - like using Collators, instead of always assuming 
lexicographical ordering (I assume this is what it uses by default ) ?


e) In the argument construction in function section - is the semantic 
change applicable only to arthematic operations ? Only to aggregate udfs 
? Or to all udfs ?

What happens in this case :

employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
grouped = GROUP employee BY name;
total_compensation = FOREACH grouped {
   T1 = employee.salary;
   T2 = employee.bonus_multiplier);
   GENERATE group, myUDF(T1 * T2) --- error ?
}
Similarly, for GENERATE group, myUDF(T1, T2) above ?




Thanks,
Mridul

Re: Pig 2.0 operators

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Sure.
I am still going through the 50 odd udfs and the pig scripts we have to 
see what is involved in porting them.
If there are no immediate suggestions/comments for the q's I raised, I 
will send out a more comprehensive list with those too included later on.


Regards,
Mridul

Olga Natkovich wrote:
> It would be good to have one list with all the questions that
> documentation did not clarify for you. I am hoping it addressed more
> than just NULL issues. 
> 
> Olga 
> 
>> -----Original Message-----
>> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
>> Sent: Monday, February 09, 2009 1:48 PM
>> To: pig-user@hadoop.apache.org
>> Subject: Re: Pig 2.0 operators
>>
>>
>> All questions below and in other mails where there were no 
>> responses (from me or others ?).
>>
>> Thanks,
>> Mridul
>>
>> Olga Natkovich wrote:
>>> Could you please summarize the list of question that you 
>> feel are not 
>>> adequately covered in the document so we can address them.
>>>
>>> Thanks,
>>>
>>> Olga
>>>
>>>> -----Original Message-----
>>>> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
>>>> Sent: Monday, February 09, 2009 12:23 PM
>>>> To: pig-user@hadoop.apache.org
>>>> Subject: Re: Pig 2.0 operators
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> To answer some of my questions below for general audience, 
>> based on 
>>>> doc Olga mentioned - 
>>>> http://wiki.apache.org/pig-data/attachments/FrontPage/attachme
>>> nts/plrm.htm
>>>> (someone should update spec with this, way more informative
>>>> !) ... could not find something which explained the others though.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> Mridul Muralidharan wrote:
>>>>> Hi,
>>>>>
>>>>>   Have following queries while going through types func spec.
>>>>>
>>>>>
>>>>> a) What does MATCHES on two bytearrays mean ? Spec says it is 
>>>>> supported without any comment.
>>>> Though not explicitly specified, my feeling is that it is gettig 
>>>> casted to chararray.
>>>>
>>>>
>>>>> b) Multiplication/Division between bag/tuple and primitives
>>>> - says it is
>>>>> not implemented, but what is the expectation when it does
>>>> get done ? 
>>>>> Apply to individual fields recursively ?
>>>>>
>>>>> c) What does CONCAT of two bytearrays mean ? Just combining
>>>> both arrays
>>>>> into a new larger array through array copies ? (I am
>>>> assuming this is
>>>>> what concat of chararray does)
>>>> New array with concat'ed contents from prev two bytearrays 
>> ... imo, 
>>>> use with caution since it is rude concat on binary blobs.
>>>>
>>>>> d) For aggregate functions MIN and MAX, can we provide our own 
>>>>> comparator (udf or otherwise) for the chararrays - to
>>>> define what the
>>>>> relative ordering is - like using Collators, instead of
>>>> always assuming
>>>>> lexicographical ordering (I assume this is what it uses by
>>>> default ) ?
>>>>> e) In the argument construction in function section - is
>>>> the semantic
>>>>> change applicable only to arthematic operations ? Only to
>>>> aggregate udfs
>>>>> ? Or to all udfs ?
>>>>>
>>>>> What happens in this case :
>>>>>
>>>>> employee = LOAD 'employee' AS (name, salary, bonus_multiplier); 
>>>>> grouped = GROUP employee BY name; total_compensation = FOREACH 
>>>>> grouped {
>>>>>   T1 = employee.salary;
>>>>>   T2 = employee.bonus_multiplier);
>>>>>   GENERATE group, myUDF(T1 * T2) --- error ?
>>>>> }
>>>>> Similarly, for GENERATE group, myUDF(T1, T2) above ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Mridul
>>


RE: Pig 2.0 operators

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
It would be good to have one list with all the questions that
documentation did not clarify for you. I am hoping it addressed more
than just NULL issues. 

Olga 

> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
> Sent: Monday, February 09, 2009 1:48 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Pig 2.0 operators
> 
> 
> All questions below and in other mails where there were no 
> responses (from me or others ?).
> 
> Thanks,
> Mridul
> 
> Olga Natkovich wrote:
> > Could you please summarize the list of question that you 
> feel are not 
> > adequately covered in the document so we can address them.
> > 
> > Thanks,
> > 
> > Olga
> > 
> >> -----Original Message-----
> >> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
> >> Sent: Monday, February 09, 2009 12:23 PM
> >> To: pig-user@hadoop.apache.org
> >> Subject: Re: Pig 2.0 operators
> >>
> >>
> >> Hi all,
> >>
> >> To answer some of my questions below for general audience, 
> based on 
> >> doc Olga mentioned - 
> >> http://wiki.apache.org/pig-data/attachments/FrontPage/attachme
> > nts/plrm.htm
> >> (someone should update spec with this, way more informative
> >> !) ... could not find something which explained the others though.
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> Mridul Muralidharan wrote:
> >>> Hi,
> >>>
> >>>   Have following queries while going through types func spec.
> >>>
> >>>
> >>> a) What does MATCHES on two bytearrays mean ? Spec says it is 
> >>> supported without any comment.
> >>
> >> Though not explicitly specified, my feeling is that it is gettig 
> >> casted to chararray.
> >>
> >>
> >>> b) Multiplication/Division between bag/tuple and primitives
> >> - says it is
> >>> not implemented, but what is the expectation when it does
> >> get done ? 
> >>> Apply to individual fields recursively ?
> >>>
> >>> c) What does CONCAT of two bytearrays mean ? Just combining
> >> both arrays
> >>> into a new larger array through array copies ? (I am
> >> assuming this is
> >>> what concat of chararray does)
> >> New array with concat'ed contents from prev two bytearrays 
> ... imo, 
> >> use with caution since it is rude concat on binary blobs.
> >>
> >>> d) For aggregate functions MIN and MAX, can we provide our own 
> >>> comparator (udf or otherwise) for the chararrays - to
> >> define what the
> >>> relative ordering is - like using Collators, instead of
> >> always assuming
> >>> lexicographical ordering (I assume this is what it uses by
> >> default ) ?
> >>>
> >>> e) In the argument construction in function section - is
> >> the semantic
> >>> change applicable only to arthematic operations ? Only to
> >> aggregate udfs
> >>> ? Or to all udfs ?
> >>>
> >>> What happens in this case :
> >>>
> >>> employee = LOAD 'employee' AS (name, salary, bonus_multiplier); 
> >>> grouped = GROUP employee BY name; total_compensation = FOREACH 
> >>> grouped {
> >>>   T1 = employee.salary;
> >>>   T2 = employee.bonus_multiplier);
> >>>   GENERATE group, myUDF(T1 * T2) --- error ?
> >>> }
> >>> Similarly, for GENERATE group, myUDF(T1, T2) above ?
> >>>
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Mridul
> >>
> 
> 

Re: Pig 2.0 operators

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
All questions below and in other mails where there were no responses 
(from me or others ?).

Thanks,
Mridul

Olga Natkovich wrote:
> Could you please summarize the list of question that you feel are not
> adequately covered in the document so we can address them.
> 
> Thanks,
> 
> Olga 
> 
>> -----Original Message-----
>> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
>> Sent: Monday, February 09, 2009 12:23 PM
>> To: pig-user@hadoop.apache.org
>> Subject: Re: Pig 2.0 operators
>>
>>
>> Hi all,
>>
>> To answer some of my questions below for general audience, 
>> based on doc Olga mentioned - 
>> http://wiki.apache.org/pig-data/attachments/FrontPage/attachme
> nts/plrm.htm
>> (someone should update spec with this, way more informative 
>> !) ... could not find something which explained the others though.
>>
>>
>> Regards,
>> Mridul
>>
>>
>> Mridul Muralidharan wrote:
>>> Hi,
>>>
>>>   Have following queries while going through types func spec.
>>>
>>>
>>> a) What does MATCHES on two bytearrays mean ? Spec says it is 
>>> supported without any comment.
>>
>> Though not explicitly specified, my feeling is that it is 
>> gettig casted to chararray.
>>
>>
>>> b) Multiplication/Division between bag/tuple and primitives 
>> - says it is 
>>> not implemented, but what is the expectation when it does 
>> get done ? 
>>> Apply to individual fields recursively ?
>>>
>>> c) What does CONCAT of two bytearrays mean ? Just combining 
>> both arrays 
>>> into a new larger array through array copies ? (I am 
>> assuming this is 
>>> what concat of chararray does)
>> New array with concat'ed contents from prev two bytearrays 
>> ... imo, use 
>> with caution since it is rude concat on binary blobs.
>>
>>> d) For aggregate functions MIN and MAX, can we provide our own 
>>> comparator (udf or otherwise) for the chararrays - to 
>> define what the 
>>> relative ordering is - like using Collators, instead of 
>> always assuming 
>>> lexicographical ordering (I assume this is what it uses by 
>> default ) ?
>>>
>>> e) In the argument construction in function section - is 
>> the semantic 
>>> change applicable only to arthematic operations ? Only to 
>> aggregate udfs 
>>> ? Or to all udfs ?
>>>
>>> What happens in this case :
>>>
>>> employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
>>> grouped = GROUP employee BY name;
>>> total_compensation = FOREACH grouped {
>>>   T1 = employee.salary;
>>>   T2 = employee.bonus_multiplier);
>>>   GENERATE group, myUDF(T1 * T2) --- error ?
>>> }
>>> Similarly, for GENERATE group, myUDF(T1, T2) above ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Mridul
>>


RE: Pig 2.0 operators

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Could you please summarize the list of question that you feel are not
adequately covered in the document so we can address them.

Thanks,

Olga 

> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
> Sent: Monday, February 09, 2009 12:23 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Pig 2.0 operators
> 
> 
> Hi all,
> 
> To answer some of my questions below for general audience, 
> based on doc Olga mentioned - 
> http://wiki.apache.org/pig-data/attachments/FrontPage/attachme
nts/plrm.htm
> (someone should update spec with this, way more informative 
> !) ... could not find something which explained the others though.
> 
> 
> Regards,
> Mridul
> 
> 
> Mridul Muralidharan wrote:
> > Hi,
> > 
> >   Have following queries while going through types func spec.
> > 
> > 
> > a) What does MATCHES on two bytearrays mean ? Spec says it is 
> > supported without any comment.
> 
> 
> Though not explicitly specified, my feeling is that it is 
> gettig casted to chararray.
> 
> 
> > 
> > b) Multiplication/Division between bag/tuple and primitives 
> - says it is 
> > not implemented, but what is the expectation when it does 
> get done ? 
> > Apply to individual fields recursively ?
> > 
> > c) What does CONCAT of two bytearrays mean ? Just combining 
> both arrays 
> > into a new larger array through array copies ? (I am 
> assuming this is 
> > what concat of chararray does)
> 
> New array with concat'ed contents from prev two bytearrays 
> ... imo, use 
> with caution since it is rude concat on binary blobs.
> 
> > 
> > d) For aggregate functions MIN and MAX, can we provide our own 
> > comparator (udf or otherwise) for the chararrays - to 
> define what the 
> > relative ordering is - like using Collators, instead of 
> always assuming 
> > lexicographical ordering (I assume this is what it uses by 
> default ) ?
> > 
> > 
> > e) In the argument construction in function section - is 
> the semantic 
> > change applicable only to arthematic operations ? Only to 
> aggregate udfs 
> > ? Or to all udfs ?
> > 
> > What happens in this case :
> > 
> > employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
> > grouped = GROUP employee BY name;
> > total_compensation = FOREACH grouped {
> >   T1 = employee.salary;
> >   T2 = employee.bonus_multiplier);
> >   GENERATE group, myUDF(T1 * T2) --- error ?
> > }
> > Similarly, for GENERATE group, myUDF(T1, T2) above ?
> > 
> > 
> > 
> > 
> > Thanks,
> > Mridul
> 
> 

Re: Pig 2.0 operators

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi all,

To answer some of my questions below for general audience, based on doc 
Olga mentioned - 
http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm 
(someone should update spec with this, way more informative !) ... could 
not find something which explained the others though.


Regards,
Mridul


Mridul Muralidharan wrote:
> Hi,
> 
>   Have following queries while going through types func spec.
> 
> 
> a) What does MATCHES on two bytearrays mean ? Spec says it is supported 
> without any comment.


Though not explicitly specified, my feeling is that it is gettig casted 
to chararray.


> 
> b) Multiplication/Division between bag/tuple and primitives - says it is 
> not implemented, but what is the expectation when it does get done ? 
> Apply to individual fields recursively ?
> 
> c) What does CONCAT of two bytearrays mean ? Just combining both arrays 
> into a new larger array through array copies ? (I am assuming this is 
> what concat of chararray does)

New array with concat'ed contents from prev two bytearrays ... imo, use 
with caution since it is rude concat on binary blobs.

> 
> d) For aggregate functions MIN and MAX, can we provide our own 
> comparator (udf or otherwise) for the chararrays - to define what the 
> relative ordering is - like using Collators, instead of always assuming 
> lexicographical ordering (I assume this is what it uses by default ) ?
> 
> 
> e) In the argument construction in function section - is the semantic 
> change applicable only to arthematic operations ? Only to aggregate udfs 
> ? Or to all udfs ?
> 
> What happens in this case :
> 
> employee = LOAD 'employee' AS (name, salary, bonus_multiplier);
> grouped = GROUP employee BY name;
> total_compensation = FOREACH grouped {
>   T1 = employee.salary;
>   T2 = employee.bonus_multiplier);
>   GENERATE group, myUDF(T1 * T2) --- error ?
> }
> Similarly, for GENERATE group, myUDF(T1, T2) above ?
> 
> 
> 
> 
> Thanks,
> Mridul