You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Miles Scruggs <mi...@digitalphotobox.net> on 2009/10/07 01:49:13 UTC

UDF processing

Hi,

I have some data that I'm trying to join, but since the join isn't a  
straight match it requires pushing the key into a UDF to find a  
match.  I'm just wondering what the best way to do this is.  I have a  
radix trie and search function written to do the match based on the  
key, but as the trie is about 12.5megs I don't want it to be loaded  
each and every record, but rather once for the group and then just  
process every record.

What I'm wondering is since the code to load the B data and build the  
trie is in the UDF will that same process be ran each record or is pig  
smart enough to make sure that only happens once?  Is there a better  
way I should be doing this?

To complicate matters further, I would like the data that is loaded  
into the trie to be a function of the record being passed.  Now it  
isn't as bad as it seems as multiple records will be grouped together  
so there would be 100k records between changes in the trie.  This  
latter part I can definitely partition out into a separate job, but it  
seems like no matter how I do it, the first issue is always going to  
be a problem.

Please share any advise that would be helpful here.  It maybe that I'm  
pushing pig to be something it was never designed to do, but since my  
java isn't so hot I'm really leaning toward pig.

Cheers

Miles

Re: UDF processing

Posted by Nikhil Gupta <gu...@gmail.com>.

I faced a similar issue a month or two ago. We were thinking of the
design of the system, did not reach the actual implementation.

We had lots of "tokenization" knowledge data to be used by the UDF. We
finally settled that it would be better to put it all on a separate
server being queried from within the UDF. [yes, it breaks the
distributed architecture, but then the data to be loaded in UDF was
too much, and we already had such a server ready with all the data].

Nikhil Gupta
Grad Student Stanford University
http://comimix.com - webcomics remixed!

On Tue, Oct 6, 2009 at 6:51 PM, Jeff Zhang <zj...@gmail.com> wrote:
> Hi Miles,
>
> I suggest you put the B data in hdfs, and load them in your UDF.
>
> Put the load code in exec rather than in constructor, because UDF will been
> instantiated several times.
>
> Just like the following
>
>
> public void exec(Tuple input){
>      if (!alreadyLoad){
>           load();
>      }
>      ......
>
>
> }
>
>
>
> Best regards,
> Jeff zhang
>
>
> On Tue, Oct 6, 2009 at 5:42 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> Hi Miles,
>>
>> Sounds like you are working on something interesting, would love to hear
>> details!
>>
>> Since UDFs are just Java classes, you can do anything in them you can do in
>> Java, including keeping your trie as a private variable in the UDF class,
>> and in every exec() call, checking if the current trie is relevant -- if it
>> is, you just reuse it, if not, you replace it with right one.
>>
>> -Dmitriy
>>
>> On Tue, Oct 6, 2009 at 7:49 PM, Miles Scruggs <miles@digitalphotobox.net
>> >wrote:
>>
>> > Hi,
>> >
>> > I have some data that I'm trying to join, but since the join isn't a
>> > straight match it requires pushing the key into a UDF to find a match.
>>  I'm
>> > just wondering what the best way to do this is.  I have a radix trie and
>> > search function written to do the match based on the key, but as the trie
>> is
>> > about 12.5megs I don't want it to be loaded each and every record, but
>> > rather once for the group and then just process every record.
>> >
>> > What I'm wondering is since the code to load the B data and build the
>> trie
>> > is in the UDF will that same process be ran each record or is pig smart
>> > enough to make sure that only happens once?  Is there a better way I
>> should
>> > be doing this?
>> >
>> > To complicate matters further, I would like the data that is loaded into
>> > the trie to be a function of the record being passed.  Now it isn't as
>> bad
>> > as it seems as multiple records will be grouped together so there would
>> be
>> > 100k records between changes in the trie.  This latter part I can
>> definitely
>> > partition out into a separate job, but it seems like no matter how I do
>> it,
>> > the first issue is always going to be a problem.
>> >
>> > Please share any advise that would be helpful here.  It maybe that I'm
>> > pushing pig to be something it was never designed to do, but since my
>> java
>> > isn't so hot I'm really leaning toward pig.
>> >
>> > Cheers
>> >
>> > Miles
>> >
>> >
>> >
>> >
>>
>

Re: UDF processing

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Miles,

I suggest you put the B data in hdfs, and load them in your UDF.

Put the load code in exec rather than in constructor, because UDF will been
instantiated several times.

Just like the following


public void exec(Tuple input){
      if (!alreadyLoad){
           load();
      }
      ......


}



Best regards,
Jeff zhang


On Tue, Oct 6, 2009 at 5:42 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hi Miles,
>
> Sounds like you are working on something interesting, would love to hear
> details!
>
> Since UDFs are just Java classes, you can do anything in them you can do in
> Java, including keeping your trie as a private variable in the UDF class,
> and in every exec() call, checking if the current trie is relevant -- if it
> is, you just reuse it, if not, you replace it with right one.
>
> -Dmitriy
>
> On Tue, Oct 6, 2009 at 7:49 PM, Miles Scruggs <miles@digitalphotobox.net
> >wrote:
>
> > Hi,
> >
> > I have some data that I'm trying to join, but since the join isn't a
> > straight match it requires pushing the key into a UDF to find a match.
>  I'm
> > just wondering what the best way to do this is.  I have a radix trie and
> > search function written to do the match based on the key, but as the trie
> is
> > about 12.5megs I don't want it to be loaded each and every record, but
> > rather once for the group and then just process every record.
> >
> > What I'm wondering is since the code to load the B data and build the
> trie
> > is in the UDF will that same process be ran each record or is pig smart
> > enough to make sure that only happens once?  Is there a better way I
> should
> > be doing this?
> >
> > To complicate matters further, I would like the data that is loaded into
> > the trie to be a function of the record being passed.  Now it isn't as
> bad
> > as it seems as multiple records will be grouped together so there would
> be
> > 100k records between changes in the trie.  This latter part I can
> definitely
> > partition out into a separate job, but it seems like no matter how I do
> it,
> > the first issue is always going to be a problem.
> >
> > Please share any advise that would be helpful here.  It maybe that I'm
> > pushing pig to be something it was never designed to do, but since my
> java
> > isn't so hot I'm really leaning toward pig.
> >
> > Cheers
> >
> > Miles
> >
> >
> >
> >
>

Re: UDF processing

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi Miles,

Sounds like you are working on something interesting, would love to hear
details!

Since UDFs are just Java classes, you can do anything in them you can do in
Java, including keeping your trie as a private variable in the UDF class,
and in every exec() call, checking if the current trie is relevant -- if it
is, you just reuse it, if not, you replace it with right one.

-Dmitriy

On Tue, Oct 6, 2009 at 7:49 PM, Miles Scruggs <mi...@digitalphotobox.net>wrote:

> Hi,
>
> I have some data that I'm trying to join, but since the join isn't a
> straight match it requires pushing the key into a UDF to find a match.  I'm
> just wondering what the best way to do this is.  I have a radix trie and
> search function written to do the match based on the key, but as the trie is
> about 12.5megs I don't want it to be loaded each and every record, but
> rather once for the group and then just process every record.
>
> What I'm wondering is since the code to load the B data and build the trie
> is in the UDF will that same process be ran each record or is pig smart
> enough to make sure that only happens once?  Is there a better way I should
> be doing this?
>
> To complicate matters further, I would like the data that is loaded into
> the trie to be a function of the record being passed.  Now it isn't as bad
> as it seems as multiple records will be grouped together so there would be
> 100k records between changes in the trie.  This latter part I can definitely
> partition out into a separate job, but it seems like no matter how I do it,
> the first issue is always going to be a problem.
>
> Please share any advise that would be helpful here.  It maybe that I'm
> pushing pig to be something it was never designed to do, but since my java
> isn't so hot I'm really leaning toward pig.
>
> Cheers
>
> Miles
>
>
>
>