You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Arun A K <ar...@gmail.com> on 2010/12/02 19:53:23 UTC

Find variants of a term in relation A from a field in relation B

Hello

I have this problem to solve using Pig.

*Input*
1. Relation A which has only one field of type chararray. Sample of A
follows:
*abc*
*xyz gh*
*zzz yy*
*red*

Approximate numbers of rows in A = 10,000

2. Relation B which has only one field of type chararray. Sample of B
follows:
*red car*
*red ferrari*
*abc*
*abcd*
*xyz ghis*

Approximate numbers of rows in B = 1 billion

*Problem to be solved* I need to find all case-insensitive variants of each
term in relation A existing in relation B. For example: Term 'red' from A
would have variants 'red car' and 'red ferrari' in B.

I was able to get variants of one term in A from B using matches operator
i.e. matches '.*red.*' How to go about creating a complete solution for this
problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
on how to proceed on this. I would really appreciate any help on this.

Thanks much.

Regards
Arun A K

Re: Find variants of a term in relation A from a field in relation B

Posted by Arun A K <ar...@gmail.com>.

Thanks much Daniel for the response. The solution looks good though I wonder
it might miss terms in A which contain spaces because STRSPLIT would break
each term in B before doing the equi join.

Regards
Arun A K
Graduate Student
Department of Computer Science
Indiana University, Bloomington


On Thu, Dec 2, 2010 at 11:17 AM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> Can you convert it into a equal join problem? That's the case mapreduce can
> handle efficiently. Not sure if it address your problem but provide a sample
> script.
>
> a = load 'A' as (a0:chararray);
> b = foreach a generate LOWER(a0) as b0;
>
> c = load 'B' as (c0:chararray);
> d = foreach c generate LOWER(c0) as d0;
> e = foreach d generate d0, flatten(STRSPLIT(d0)) as e0;
>
> f = join b by b0, e by e0;
> g = foreach f generate d0, b0;
> dump g;
>
> Daniel
>
>
> Arun A K wrote:
>
>> Hello
>>
>> I have this problem to solve using Pig.
>>
>> *Input*
>> 1. Relation A which has only one field of type chararray. Sample of A
>> follows:
>> *abc*
>> *xyz gh*
>> *zzz yy*
>> *red*
>>
>> Approximate numbers of rows in A = 10,000
>>
>> 2. Relation B which has only one field of type chararray. Sample of B
>> follows:
>> *red car*
>> *red ferrari*
>> *abc*
>> *abcd*
>> *xyz ghis*
>>
>> Approximate numbers of rows in B = 1 billion
>>
>> *Problem to be solved* I need to find all case-insensitive variants of
>> each
>> term in relation A existing in relation B. For example: Term 'red' from A
>> would have variants 'red car' and 'red ferrari' in B.
>>
>> I was able to get variants of one term in A from B using matches operator
>> i.e. matches '.*red.*' How to go about creating a complete solution for
>> this
>> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
>> on how to proceed on this. I would really appreciate any help on this.
>>
>> Thanks much.
>>
>> Regards
>> Arun A K
>>
>>
>
>

Re: Find variants of a term in relation A from a field in relation B

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Can you convert it into a equal join problem? That's the case mapreduce 
can handle efficiently. Not sure if it address your problem but provide 
a sample script.

a = load 'A' as (a0:chararray);
b = foreach a generate LOWER(a0) as b0;

c = load 'B' as (c0:chararray);
d = foreach c generate LOWER(c0) as d0;
e = foreach d generate d0, flatten(STRSPLIT(d0)) as e0;

f = join b by b0, e by e0;
g = foreach f generate d0, b0;
dump g;

Daniel

Arun A K wrote:
> Hello
>
> I have this problem to solve using Pig.
>
> *Input*
> 1. Relation A which has only one field of type chararray. Sample of A
> follows:
> *abc*
> *xyz gh*
> *zzz yy*
> *red*
>
> Approximate numbers of rows in A = 10,000
>
> 2. Relation B which has only one field of type chararray. Sample of B
> follows:
> *red car*
> *red ferrari*
> *abc*
> *abcd*
> *xyz ghis*
>
> Approximate numbers of rows in B = 1 billion
>
> *Problem to be solved* I need to find all case-insensitive variants of each
> term in relation A existing in relation B. For example: Term 'red' from A
> would have variants 'red car' and 'red ferrari' in B.
>
> I was able to get variants of one term in A from B using matches operator
> i.e. matches '.*red.*' How to go about creating a complete solution for this
> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
> on how to proceed on this. I would really appreciate any help on this.
>
> Thanks much.
>
> Regards
> Arun A K
>

Re: Find variants of a term in relation A from a field in relation B

Posted by Arun A K <ar...@gmail.com>.

Thanks Zach for introducing me to Aho Corasick algorithm. Let me see if a
UDF can help here.

Regards
Arun A K
Graduate Student
Department of Computer Science
Indiana University, Bloomington


On Thu, Dec 2, 2010 at 11:01 AM, Zach Bailey <za...@dataclip.com>wrote:

>
>  I would recommend writing a UDF based on the Aho Corasick algorithm [1]
> which is seeded with relation a and then applied to relation b. This should
> let you do a single-pass search on each "row" in B, reducing your
> algorithmic complexity from O(n^2) to O(n).
>
>
> I'd love to know if anyone else has a better way of doing this, because
> it's a problem I'm also working on :)
>
> -Zach
>
>
>
> [1]
> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
>
> On Thursday, December 2, 2010 at 1:53 PM, Arun A K wrote:
>
> > Hello
> >
> > I have this problem to solve using Pig.
> >
> > *Input*
> > 1. Relation A which has only one field of type chararray. Sample of A
> > follows:
> > *abc*
> > *xyz gh*
> > *zzz yy*
> > *red*
> >
> > Approximate numbers of rows in A = 10,000
> >
> > 2. Relation B which has only one field of type chararray. Sample of B
> > follows:
> > *red car*
> > *red ferrari*
> > *abc*
> > *abcd*
> > *xyz ghis*
> >
> > Approximate numbers of rows in B = 1 billion
> >
> > *Problem to be solved* I need to find all case-insensitive variants of
> each
> > term in relation A existing in relation B. For example: Term 'red' from A
> > would have variants 'red car' and 'red ferrari' in B.
> >
> > I was able to get variants of one term in A from B using matches operator
> > i.e. matches '.*red.*' How to go about creating a complete solution for
> this
> > problem? Should I use a UDF or go for native Map Reduce? Am a bit
> confused
> > on how to proceed on this. I would really appreciate any help on this.
> >
> > Thanks much.
> >
> > Regards
> > Arun A K
> >
> >
> >
> >
>
>
>
>

Re: Find variants of a term in relation A from a field in relation B

Posted by Zach Bailey <za...@dataclip.com>.

 I would recommend writing a UDF based on the Aho Corasick algorithm [1] which is seeded with relation a and then applied to relation b. This should let you do a single-pass search on each "row" in B, reducing your algorithmic complexity from O(n^2) to O(n).


I'd love to know if anyone else has a better way of doing this, because it's a problem I'm also working on :)

-Zach



[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

On Thursday, December 2, 2010 at 1:53 PM, Arun A K wrote:

> Hello
> 
> I have this problem to solve using Pig.
> 
> *Input*
> 1. Relation A which has only one field of type chararray. Sample of A
> follows:
> *abc*
> *xyz gh*
> *zzz yy*
> *red*
> 
> Approximate numbers of rows in A = 10,000
> 
> 2. Relation B which has only one field of type chararray. Sample of B
> follows:
> *red car*
> *red ferrari*
> *abc*
> *abcd*
> *xyz ghis*
> 
> Approximate numbers of rows in B = 1 billion
> 
> *Problem to be solved* I need to find all case-insensitive variants of each
> term in relation A existing in relation B. For example: Term 'red' from A
> would have variants 'red car' and 'red ferrari' in B.
> 
> I was able to get variants of one term in A from B using matches operator
> i.e. matches '.*red.*' How to go about creating a complete solution for this
> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
> on how to proceed on this. I would really appreciate any help on this.
> 
> Thanks much.
> 
> Regards
> Arun A K
> 
> 
> 
>