You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/01/04 20:10:52 UTC

Taking advantage of structure when doing UDFs and whatnot?

I wasn't quite sure what title this, but hopefully it'll make sense. I have
a couple of questions relating to a query that ultimately seeks to do this

You have

1 10
1 12
1 15
1 16
2 1
2 2
2 3
2 6

You want your output to be the difference between the successive numbers in
the second column, ie

1 (10,0)
1 (12,2)
1 (15,3)
1 (15,1)
2 (1,0)
2 (2,1)
2 (3,1)
2 (6,3)

Obviously, I need to write a udf to do this, but I have a couple questions..

1) if we know for a fact that the rows for a given first column will ALWAYS
be on the same node, do we need to do anything to take advantage of that? My
assumption would be that the group operation would be smart enough to take
care of this, but I am not sure how it avoids checking to make sure that
other nodes don't have additional info (even if I can say for a fact that
they don't). Then again, given replication of data I guess if you do an
operation on the grouped data it might still try and distribute that over
the filesystem?

2) The number of values in the second column can potentially be large, and I
want this process to be quick, so what's the best way to implement it?
Naively I would say to group everything, then pass that bag to a UDF which
sorts, does the calculation, and then returns a new bag with the tuples.
This doesn't seem like it is taking advantage of a distributed
framework...would splitting it up into 2 UDF's, one which sorts the bag, and
then another which returns the tuples (and now that it's sorted, you could
distribute it better), be better?

I'm trying to avoid writing my own MR (as I never have before), but am not
averse to it if necessary. I am just not sure of how to get pig to do it as
efficiently as (I think) it can be done.

I appreciate your help!
Jon

Re: Taking advantage of structure when doing UDFs and whatnot?

Posted by Kris Coward <kr...@melon.org>.

On Tue, Jan 04, 2011 at 02:10:52PM -0500, Jonathan Coveney wrote:
> I wasn't quite sure what title this, but hopefully it'll make sense. I have
> a couple of questions relating to a query that ultimately seeks to do this
> 
> You have
> 
> 1 10
> 1 12
> 1 15
> 1 16
> 2 1
> 2 2
> 2 3
> 2 6
> 
> You want your output to be the difference between the successive numbers in
> the second column, ie
> 
> 1 (10,0)
> 1 (12,2)
> 1 (15,3)
> 1 (15,1)
> 2 (1,0)
> 2 (2,1)
> 2 (3,1)
> 2 (6,3)
> 
> Obviously, I need to write a udf to do this, but I have a couple questions..

If you were to have some sort of row counter, then I suspect that you
could do something along the lines of

  relCopy = relName;
  newRel = JOIN relName BY counter, relCopy BY counter-1;
  diff = FOREACH newRel GENERATE relName::stuff AS [...], relCopy::thing-relName::thing AS difference;

if you really want to avoid writing an extra UDF. But in the absence of such a
counter, yeah, I think a UDF would be necessary.

Cheers,
Kris

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Taking advantage of structure when doing UDFs and whatnot?

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Jan 4, 2011, at 2:07 PM, Jonathan Coveney wrote:

> Thanks for the help Alan, I really appreciate it. Can you currently  
> extend
> interfaces in python UDF's? I am not super familiar with how jython  
> and
> python interact in that capacity.

No, we just introduced the Python UDFs in 0.8.  We haven't yet added  
the ability for them to extend the Algebraic and Accumulator interfaces.

Alan.

>
> The internal sort in the foreach and the using 'collected' (assuming  
> I can
> get it to work :) should be big wins.
>
> 2011/1/4 Alan Gates <ga...@yahoo-inc.com>
>
>> Answers inline.
>>
>>
>> On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:
>>
>> I wasn't quite sure what title this, but hopefully it'll make  
>> sense. I
>>> have
>>> a couple of questions relating to a query that ultimately seeks to  
>>> do this
>>>
>>> You have
>>>
>>> 1 10
>>> 1 12
>>> 1 15
>>> 1 16
>>> 2 1
>>> 2 2
>>> 2 3
>>> 2 6
>>>
>>> You want your output to be the difference between the successive  
>>> numbers
>>> in
>>> the second column, ie
>>>
>>> 1 (10,0)
>>> 1 (12,2)
>>> 1 (15,3)
>>> 1 (15,1)
>>> 2 (1,0)
>>> 2 (2,1)
>>> 2 (3,1)
>>> 2 (6,3)
>>>
>>> Obviously, I need to write a udf to do this, but I have a couple
>>> questions..
>>>
>>> 1) if we know for a fact that the rows for a given first column will
>>> ALWAYS
>>> be on the same node, do we need to do anything to take advantage  
>>> of that?
>>> My
>>> assumption would be that the group operation would be smart enough  
>>> to take
>>> care of this, but I am not sure how it avoids checking to make  
>>> sure that
>>> other nodes don't have additional info (even if I can say for a  
>>> fact that
>>> they don't). Then again, given replication of data I guess if you  
>>> do an
>>> operation on the grouped data it might still try and distribute  
>>> that over
>>> the filesystem?
>>>
>>
>> First, whether they are located in the same node does not matter.   
>> What
>> matters is whether they will all be in the same split when the maps  
>> are
>> started.  If they are stored in an HDFS file this usually means  
>> that they
>> are all in the same block.
>>
>> Group by cannot know a priori that all values of the key will be  
>> located in
>> the same split.  As of Pig 0.7 you can tell Pig this by saying "using
>> 'collected'" after the group by statement.  See
>> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP for exact
>> syntax and restrictions. This tells Pig to do the grouping in the  
>> map phase
>> since it does not need to do a shuffle and reduce to collect all  
>> the keys
>> together.
>>
>>
>>
>>> 2) The number of values in the second column can potentially be  
>>> large, and
>>> I
>>> want this process to be quick, so what's the best way to implement  
>>> it?
>>> Naively I would say to group everything, then pass that bag to a  
>>> UDF which
>>> sorts, does the calculation, and then returns a new bag with the  
>>> tuples.
>>> This doesn't seem like it is taking advantage of a distributed
>>> framework...would splitting it up into 2 UDF's, one which sorts  
>>> the bag,
>>> and
>>> then another which returns the tuples (and now that it's sorted,  
>>> you could
>>> distribute it better), be better?
>>>
>>
>> B = group A by firstfield;
>> C = foreach B {
>>       C1 = order A by secondfield;
>>       generate group, youudf(C1);
>> }
>>
>> The order inside the foreach will order each collection by the second
>> field, so there's no need to write a UDF for that.  In fact Pig  
>> will take
>> advantage of the secondary sort in MR so that there isn't even a  
>> separate
>> sorting pass over the data.  yourudf should then implement the  
>> Accumulator
>> interface so that it will receive collections of records in batches  
>> that
>> will be sorted.
>>
>> Alan.
>>
>>
>>
>>> I'm trying to avoid writing my own MR (as I never have before),  
>>> but am not
>>> averse to it if necessary. I am just not sure of how to get pig to  
>>> do it
>>> as
>>> efficiently as (I think) it can be done.
>>>
>>> I appreciate your help!
>>> Jon
>>>
>>
>>

Re: Taking advantage of structure when doing UDFs and whatnot?

Posted by Jonathan Coveney <jc...@gmail.com>.

Thanks for the help Alan, I really appreciate it. Can you currently extend
interfaces in python UDF's? I am not super familiar with how jython and
python interact in that capacity.

The internal sort in the foreach and the using 'collected' (assuming I can
get it to work :) should be big wins.

2011/1/4 Alan Gates <ga...@yahoo-inc.com>

> Answers inline.
>
>
> On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:
>
>  I wasn't quite sure what title this, but hopefully it'll make sense. I
>> have
>> a couple of questions relating to a query that ultimately seeks to do this
>>
>> You have
>>
>> 1 10
>> 1 12
>> 1 15
>> 1 16
>> 2 1
>> 2 2
>> 2 3
>> 2 6
>>
>> You want your output to be the difference between the successive numbers
>> in
>> the second column, ie
>>
>> 1 (10,0)
>> 1 (12,2)
>> 1 (15,3)
>> 1 (15,1)
>> 2 (1,0)
>> 2 (2,1)
>> 2 (3,1)
>> 2 (6,3)
>>
>> Obviously, I need to write a udf to do this, but I have a couple
>> questions..
>>
>> 1) if we know for a fact that the rows for a given first column will
>> ALWAYS
>> be on the same node, do we need to do anything to take advantage of that?
>> My
>> assumption would be that the group operation would be smart enough to take
>> care of this, but I am not sure how it avoids checking to make sure that
>> other nodes don't have additional info (even if I can say for a fact that
>> they don't). Then again, given replication of data I guess if you do an
>> operation on the grouped data it might still try and distribute that over
>> the filesystem?
>>
>
> First, whether they are located in the same node does not matter.  What
> matters is whether they will all be in the same split when the maps are
> started.  If they are stored in an HDFS file this usually means that they
> are all in the same block.
>
> Group by cannot know a priori that all values of the key will be located in
> the same split.  As of Pig 0.7 you can tell Pig this by saying "using
> 'collected'" after the group by statement.  See
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP for exact
> syntax and restrictions. This tells Pig to do the grouping in the map phase
> since it does not need to do a shuffle and reduce to collect all the keys
> together.
>
>
>
>> 2) The number of values in the second column can potentially be large, and
>> I
>> want this process to be quick, so what's the best way to implement it?
>> Naively I would say to group everything, then pass that bag to a UDF which
>> sorts, does the calculation, and then returns a new bag with the tuples.
>> This doesn't seem like it is taking advantage of a distributed
>> framework...would splitting it up into 2 UDF's, one which sorts the bag,
>> and
>> then another which returns the tuples (and now that it's sorted, you could
>> distribute it better), be better?
>>
>
> B = group A by firstfield;
> C = foreach B {
>        C1 = order A by secondfield;
>        generate group, youudf(C1);
> }
>
> The order inside the foreach will order each collection by the second
> field, so there's no need to write a UDF for that.  In fact Pig will take
> advantage of the secondary sort in MR so that there isn't even a separate
> sorting pass over the data.  yourudf should then implement the Accumulator
> interface so that it will receive collections of records in batches that
> will be sorted.
>
> Alan.
>
>
>
>> I'm trying to avoid writing my own MR (as I never have before), but am not
>> averse to it if necessary. I am just not sure of how to get pig to do it
>> as
>> efficiently as (I think) it can be done.
>>
>> I appreciate your help!
>> Jon
>>
>
>

Re: Taking advantage of structure when doing UDFs and whatnot?

Posted by Alan Gates <ga...@yahoo-inc.com>.

Answers inline.

On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote:

> I wasn't quite sure what title this, but hopefully it'll make sense.  
> I have
> a couple of questions relating to a query that ultimately seeks to  
> do this
>
> You have
>
> 1 10
> 1 12
> 1 15
> 1 16
> 2 1
> 2 2
> 2 3
> 2 6
>
> You want your output to be the difference between the successive  
> numbers in
> the second column, ie
>
> 1 (10,0)
> 1 (12,2)
> 1 (15,3)
> 1 (15,1)
> 2 (1,0)
> 2 (2,1)
> 2 (3,1)
> 2 (6,3)
>
> Obviously, I need to write a udf to do this, but I have a couple  
> questions..
>
> 1) if we know for a fact that the rows for a given first column will  
> ALWAYS
> be on the same node, do we need to do anything to take advantage of  
> that? My
> assumption would be that the group operation would be smart enough  
> to take
> care of this, but I am not sure how it avoids checking to make sure  
> that
> other nodes don't have additional info (even if I can say for a fact  
> that
> they don't). Then again, given replication of data I guess if you do  
> an
> operation on the grouped data it might still try and distribute that  
> over
> the filesystem?

First, whether they are located in the same node does not matter.   
What matters is whether they will all be in the same split when the  
maps are started.  If they are stored in an HDFS file this usually  
means that they are all in the same block.

Group by cannot know a priori that all values of the key will be  
located in the same split.  As of Pig 0.7 you can tell Pig this by  
saying "using 'collected'" after the group by statement.  See http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#GROUP 
  for exact syntax and restrictions. This tells Pig to do the grouping  
in the map phase since it does not need to do a shuffle and reduce to  
collect all the keys together.

>
> 2) The number of values in the second column can potentially be  
> large, and I
> want this process to be quick, so what's the best way to implement it?
> Naively I would say to group everything, then pass that bag to a UDF  
> which
> sorts, does the calculation, and then returns a new bag with the  
> tuples.
> This doesn't seem like it is taking advantage of a distributed
> framework...would splitting it up into 2 UDF's, one which sorts the  
> bag, and
> then another which returns the tuples (and now that it's sorted, you  
> could
> distribute it better), be better?

B = group A by firstfield;
C = foreach B {
	C1 = order A by secondfield;
	generate group, youudf(C1);
}

The order inside the foreach will order each collection by the second  
field, so there's no need to write a UDF for that.  In fact Pig will  
take advantage of the secondary sort in MR so that there isn't even a  
separate sorting pass over the data.  yourudf should then implement  
the Accumulator interface so that it will receive collections of  
records in batches that will be sorted.

Alan.

>
> I'm trying to avoid writing my own MR (as I never have before), but  
> am not
> averse to it if necessary. I am just not sure of how to get pig to  
> do it as
> efficiently as (I think) it can be done.
>
> I appreciate your help!
> Jon