You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Brian Stempin <bs...@coldlight.com> on 2012/10/05 17:45:40 UTC

Question about UDFs and tuple ordering

Hi,
I'm fairly new to writing UDFs and Pig in general.  I want to be able to write a UDF that can take advantage of MapReduce's sorting of data.  Specifically, I'm trying to conceive how I'd write a UDF to do a specialized join or a pivot. In both cases, sorting would be useful.  EvalFunc seems to give no guarantees about ordering of tuples that are passed in.

Is there any way to do such things as a UDF?

TIA for your help,
Brian Stempin
Machine Learning Engineer
ColdLight Solutions, LLC

________________________________
This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.

RE: Question about UDFs and tuple ordering

Posted by Brian Stempin <bs...@coldlight.com>.
Awesome -- I really appreciate that insight.  Is that recorded anywhere?  If not, then perhaps I'll spend some time writing about how these things are implemented in the wiki for when others come along with similar questions.

Thanks, Alan!


 This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.

Re: Question about UDFs and tuple ordering

Posted by Alan Gates <ga...@hortonworks.com>.
Many operators, such as join and group by, are not implemented by a single physical operation.  Also, they are spread through the code as they have logical components and physical components.  The logical components of join are in org.apache.pig.newplan.logical.relational.LOJoin.java.  That gets translated to three physical operators, POLocalRearrange, POPackage, and POForeach.  All of the physical operators are in org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators

Alan.

On Oct 5, 2012, at 11:01 AM, Brian Stempin wrote:

> Thanks Russell -- That's really useful.
> 
> Just for kicks and giggles:  Where would I look in the code base to see how the JOIN keyword is implemented?  I've found the built in functions, but not the keywords (JOIN, GROUP, etc).  Perhaps that would give me some hints.  Perhaps it'll show me that a UDF might not be the best option for my set of problems.
> 
> Thanks once again,
> Brian
> 
> 
> This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.


RE: Question about UDFs and tuple ordering

Posted by Brian Stempin <bs...@coldlight.com>.
Thanks Russell -- That's really useful.

Just for kicks and giggles:  Where would I look in the code base to see how the JOIN keyword is implemented?  I've found the built in functions, but not the keywords (JOIN, GROUP, etc).  Perhaps that would give me some hints.  Perhaps it'll show me that a UDF might not be the best option for my set of problems.

Thanks once again,
Brian


 This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.

Re: Question about UDFs and tuple ordering

Posted by Russell Jurney <ru...@gmail.com>.
You can write an EvalFunc UDF that depends on a sort, and there are
several in piggybank that do so. COR (the correlate UDF) is such an
example. You call these UDFs on a relation after ordering them.

For example:

answers = foreach (group data by key)
{
  sorted = order data by value;
  generate my_udf(sorted.field1, sorted.field2);
}

If I remember correctly, you can in fact also do this:

sorted = order data by field;
answer = foreach sorted generate my_udf(sorted.field, sorted.other_field);

Although strictly speaking, Pig doesn't garuantee a sort is maintained
outside of {}

I can't help on the JOIN, I don't know about that. But check Pig's
bloom filter: http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html

Russell Jurney twitter.com/rjurney


On Oct 5, 2012, at 11:46 AM, Brian Stempin <bs...@coldlight.com> wrote:

> Hi,
> I'm fairly new to writing UDFs and Pig in general.  I want to be able to write a UDF that can take advantage of MapReduce's sorting of data.  Specifically, I'm trying to conceive how I'd write a UDF to do a specialized join or a pivot. In both cases, sorting would be useful.  EvalFunc seems to give no guarantees about ordering of tuples that are passed in.
>
> Is there any way to do such things as a UDF?
>
> TIA for your help,
> Brian Stempin
> Machine Learning Engineer
> ColdLight Solutions, LLC
>
> ________________________________
> This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.