You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Joe Ciaramitaro <jo...@gmail.com> on 2010/11/02 18:19:00 UTC

Finding records with a given prefix

Hi all,

I have 2 data files.  One which contains a number of records, and one which contains a number of prefixes.

A = load 'data' AS (id, name)
B = load 'prefixes' AS (prefix)

I'd like to pull records in A whose name begins with prefix

The prefixes are of varying lengths

I've been scouring the documentation, but haven't figured out what the best approach could be.

Thanks for any help,

Joe

Re: Finding records with a given prefix

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You don't really have to mess with that -- you can just have your UDF
initialized with the prefix file location.
So, your udf would have:

private String prefixPath;
// needed by Pig
public MyUDF() {}

// use this constructor
public MyUDF(String path) {
  this.prefixPath = path;
}

// in the eval, check if prefix file has been loaded, if not, do so

Then in pig you would say:

DEFINE MyUDFInstance org.myorg.MyUDF("/this/is/my/prefix/file");

-- load data...

processed_data = foreach data generate MyUDFInstance(some_field);


On Tue, Nov 2, 2010 at 11:27 AM, Joe Ciaramitaro <
joe.ciaramitaro.adsafe@gmail.com> wrote:

> Thanks for the quick response.. I have some follow ups though :)  --
>
> Not quite as bad(computationally expensive) as a regular expression, just
> something that would allow me to check String.startWith... but same basic
> idea
>
> Prefixes is small enough to fit into memory, but it's not clear to me how
> to make that happen.
>
> I see that the UDF has access to the JobConf, so I could pass in a
> configuration that resolves to the hdfs path of the prefixes file.  The Pig
> UDF manual shows how to receive the configurations, but I'm not sure how to
> SET them on the client side.
>
> -Joe
>
> On Nov 2, 2010, at 1:27 PM, Alan Gates wrote:
>
> > Basically you want to join on a regular expression, correct?
>  Unfortunately Map Reduce (and thus Pig) is spectacularly bad at
> non-equijoins.  Is 'prefixes' small enough to fit in memory?  If so, you
> could write a UDF that loaded it into memory and did the comparison.  This
> way the join would be done in the map phase.
> >
> > Alan.
> >
> > On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:
> >
> >> Hi all,
> >>
> >> I have 2 data files.  One which contains a number of records, and one
> which contains a number of prefixes.
> >>
> >> A = load 'data' AS (id, name)
> >> B = load 'prefixes' AS (prefix)
> >>
> >> I'd like to pull records in A whose name begins with prefix
> >>
> >> The prefixes are of varying lengths
> >>
> >> I've been scouring the documentation, but haven't figured out what the
> best approach could be.
> >>
> >> Thanks for any help,
> >>
> >> Joe
> >
>
>

Re: Finding records with a given prefix

Posted by Joe Ciaramitaro <jo...@gmail.com>.
Thanks for the quick response.. I have some follow ups though :)  --

Not quite as bad(computationally expensive) as a regular expression, just something that would allow me to check String.startWith... but same basic idea

Prefixes is small enough to fit into memory, but it's not clear to me how to make that happen.

I see that the UDF has access to the JobConf, so I could pass in a configuration that resolves to the hdfs path of the prefixes file.  The Pig UDF manual shows how to receive the configurations, but I'm not sure how to SET them on the client side.

-Joe

On Nov 2, 2010, at 1:27 PM, Alan Gates wrote:

> Basically you want to join on a regular expression, correct?  Unfortunately Map Reduce (and thus Pig) is spectacularly bad at non-equijoins.  Is 'prefixes' small enough to fit in memory?  If so, you could write a UDF that loaded it into memory and did the comparison.  This way the join would be done in the map phase.
> 
> Alan.
> 
> On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:
> 
>> Hi all,
>> 
>> I have 2 data files.  One which contains a number of records, and one which contains a number of prefixes.
>> 
>> A = load 'data' AS (id, name)
>> B = load 'prefixes' AS (prefix)
>> 
>> I'd like to pull records in A whose name begins with prefix
>> 
>> The prefixes are of varying lengths
>> 
>> I've been scouring the documentation, but haven't figured out what the best approach could be.
>> 
>> Thanks for any help,
>> 
>> Joe
> 


Re: Finding records with a given prefix

Posted by Alan Gates <ga...@yahoo-inc.com>.
Basically you want to join on a regular expression, correct?   
Unfortunately Map Reduce (and thus Pig) is spectacularly bad at non- 
equijoins.  Is 'prefixes' small enough to fit in memory?  If so, you  
could write a UDF that loaded it into memory and did the comparison.   
This way the join would be done in the map phase.

Alan.

On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:

> Hi all,
>
> I have 2 data files.  One which contains a number of records, and  
> one which contains a number of prefixes.
>
> A = load 'data' AS (id, name)
> B = load 'prefixes' AS (prefix)
>
> I'd like to pull records in A whose name begins with prefix
>
> The prefixes are of varying lengths
>
> I've been scouring the documentation, but haven't figured out what  
> the best approach could be.
>
> Thanks for any help,
>
> Joe