You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Zach Murphy <mu...@gmail.com> on 2009/07/21 18:28:27 UTC

Matching across bags

Hello,

I'm a new pig user, and I'm working on a project using pig over the summer.
I've read over the archives and couldn't find any answer to my question so I
figured I'd ask myself.

I've loaded in two relations and I want to match a field in relation A to a
field in relation B.  So, I want to see for each A.f1 if it matches any of
the words in B.f3.

The only way I saw to do this was to CROSS A and B and then try to match
each one.  However due to system restrictions, the CROSS becomes too large
and times out.

Any ideas on how to go about doing this quicker and with less space would be
greatly appreciated!

Thanks,
Zach

Re: Matching across bags

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.

Sure, there are a couple of ways to do that.
One is to replace TOKENIZE with a UDF that produces all the
combinations you are interested in (assuming it's a limited subset of
all possible phrases -- one and two-word combinations, and so on).

Another is to take your first relation (the one with phrases), and
break it into an index, as well. Join the two indexes, compute
intersections.

As a side note, if what you are doing is less data flow and more text
search, you might want to consider something like Katta (
http://katta.sourceforge.net/ ).

-Dmitriy

On Tue, Jul 21, 2009 at 11:51 AM, Zach Murphy<mu...@gmail.com> wrote:
> Thanks Dmitriy,
>
> That worked well.  I didn't even think of doing it that way.  The only
> problem I might have is if I check for phrases instead of words later.  Can
> I expand on this to use phrases?
>
> Zach
>
> On Tue, Jul 21, 2009 at 1:07 PM, Dmitriy Ryaboy <dv...@cloudera.com>wrote:
>
>> Zach -- this might be overkill, but how about using Pig to construct
>> an inverted index on your second relation, something along these
>> lines:
>>
>> words = FOREACH text GENERATE rec_id, FLATTEN( TOKENIZE(string) );
>> word_groups = GROUP words BY $1;
>> index = FOREACH word_groups {
>> recs = DISTINCT $1.$0;
>> GENERATE $0 as word, recs;
>> };
>>
>> You can now join to the index on the word, and get a list of
>> corresponding record ids.
>>
>> -Dmitriy
>>
>> On Tue, Jul 21, 2009 at 9:52 AM, Zach Murphy<mu...@gmail.com> wrote:
>> > Thanks for the quick reply Chad.  But, I should have been more clear.
>> >
>> > The field in one bag is actually a chunk of text and the field in the
>> other
>> > bag is a word.  I want to see if the word is contained in the text.  So
>> the
>> > values aren't actually the same.  I believe that this means I can't JOIN
>> or
>> > COGROUP because the fields aren't exactly the same.  I have written a udf
>> > that checks if the word is in the text.  But because the two fields
>> aren't
>> > in the same bag I don't think I can use X = FILTER bag BY
>> IsWordInText(word,
>> > text).
>> >
>> > I suppose that I could write a udf for a modified JOIN that joins if the
>> > word is in the text, but if there's an easier way I would rather do that.
>> >
>> > Zach
>> >
>> > On Tue, Jul 21, 2009 at 12:34 PM, Naber, Chad <CN...@edmunds.com>
>> wrote:
>> >
>> >> Zach,
>> >>
>> >> I would just join the two relations:
>> >>
>> >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#JOIN
>> >>
>> >> If you need to perform an outer join, use COGROUP instead:
>> >>
>> >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#COGROUP
>> >>
>> >> Good luck.
>> >> Chad
>> >>
>> >> -----Original Message-----
>> >> From: Zach Murphy [mailto:murphy.zach@gmail.com]
>> >> Sent: Tuesday, July 21, 2009 9:28 AM
>> >> To: pig-user@hadoop.apache.org
>> >> Subject: Matching across bags
>> >>
>> >> Hello,
>> >>
>> >> I'm a new pig user, and I'm working on a project using pig over the
>> summer.
>> >> I've read over the archives and couldn't find any answer to my question
>> so
>> >> I
>> >> figured I'd ask myself.
>> >>
>> >> I've loaded in two relations and I want to match a field in relation A
>> to a
>> >> field in relation B.  So, I want to see for each A.f1 if it matches any
>> of
>> >> the words in B.f3.
>> >>
>> >> The only way I saw to do this was to CROSS A and B and then try to match
>> >> each one.  However due to system restrictions, the CROSS becomes too
>> large
>> >> and times out.
>> >>
>> >> Any ideas on how to go about doing this quicker and with less space
>> would
>> >> be
>> >> greatly appreciated!
>> >>
>> >> Thanks,
>> >> Zach
>> >>
>> >>
>> >
>>
>

Re: Matching across bags

Posted by Zach Murphy <mu...@gmail.com>.

Thanks Dmitriy,

That worked well.  I didn't even think of doing it that way.  The only
problem I might have is if I check for phrases instead of words later.  Can
I expand on this to use phrases?

Zach

On Tue, Jul 21, 2009 at 1:07 PM, Dmitriy Ryaboy <dv...@cloudera.com>wrote:

> Zach -- this might be overkill, but how about using Pig to construct
> an inverted index on your second relation, something along these
> lines:
>
> words = FOREACH text GENERATE rec_id, FLATTEN( TOKENIZE(string) );
> word_groups = GROUP words BY $1;
> index = FOREACH word_groups {
> recs = DISTINCT $1.$0;
> GENERATE $0 as word, recs;
> };
>
> You can now join to the index on the word, and get a list of
> corresponding record ids.
>
> -Dmitriy
>
> On Tue, Jul 21, 2009 at 9:52 AM, Zach Murphy<mu...@gmail.com> wrote:
> > Thanks for the quick reply Chad.  But, I should have been more clear.
> >
> > The field in one bag is actually a chunk of text and the field in the
> other
> > bag is a word.  I want to see if the word is contained in the text.  So
> the
> > values aren't actually the same.  I believe that this means I can't JOIN
> or
> > COGROUP because the fields aren't exactly the same.  I have written a udf
> > that checks if the word is in the text.  But because the two fields
> aren't
> > in the same bag I don't think I can use X = FILTER bag BY
> IsWordInText(word,
> > text).
> >
> > I suppose that I could write a udf for a modified JOIN that joins if the
> > word is in the text, but if there's an easier way I would rather do that.
> >
> > Zach
> >
> > On Tue, Jul 21, 2009 at 12:34 PM, Naber, Chad <CN...@edmunds.com>
> wrote:
> >
> >> Zach,
> >>
> >> I would just join the two relations:
> >>
> >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#JOIN
> >>
> >> If you need to perform an outer join, use COGROUP instead:
> >>
> >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#COGROUP
> >>
> >> Good luck.
> >> Chad
> >>
> >> -----Original Message-----
> >> From: Zach Murphy [mailto:murphy.zach@gmail.com]
> >> Sent: Tuesday, July 21, 2009 9:28 AM
> >> To: pig-user@hadoop.apache.org
> >> Subject: Matching across bags
> >>
> >> Hello,
> >>
> >> I'm a new pig user, and I'm working on a project using pig over the
> summer.
> >> I've read over the archives and couldn't find any answer to my question
> so
> >> I
> >> figured I'd ask myself.
> >>
> >> I've loaded in two relations and I want to match a field in relation A
> to a
> >> field in relation B.  So, I want to see for each A.f1 if it matches any
> of
> >> the words in B.f3.
> >>
> >> The only way I saw to do this was to CROSS A and B and then try to match
> >> each one.  However due to system restrictions, the CROSS becomes too
> large
> >> and times out.
> >>
> >> Any ideas on how to go about doing this quicker and with less space
> would
> >> be
> >> greatly appreciated!
> >>
> >> Thanks,
> >> Zach
> >>
> >>
> >
>

Re: Matching across bags

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.

Zach -- this might be overkill, but how about using Pig to construct
an inverted index on your second relation, something along these
lines:

words = FOREACH text GENERATE rec_id, FLATTEN( TOKENIZE(string) );
word_groups = GROUP words BY $1;
index = FOREACH word_groups {
recs = DISTINCT $1.$0;
GENERATE $0 as word, recs;
};

You can now join to the index on the word, and get a list of
corresponding record ids.

-Dmitriy

On Tue, Jul 21, 2009 at 9:52 AM, Zach Murphy<mu...@gmail.com> wrote:
> Thanks for the quick reply Chad.  But, I should have been more clear.
>
> The field in one bag is actually a chunk of text and the field in the other
> bag is a word.  I want to see if the word is contained in the text.  So the
> values aren't actually the same.  I believe that this means I can't JOIN or
> COGROUP because the fields aren't exactly the same.  I have written a udf
> that checks if the word is in the text.  But because the two fields aren't
> in the same bag I don't think I can use X = FILTER bag BY IsWordInText(word,
> text).
>
> I suppose that I could write a udf for a modified JOIN that joins if the
> word is in the text, but if there's an easier way I would rather do that.
>
> Zach
>
> On Tue, Jul 21, 2009 at 12:34 PM, Naber, Chad <CN...@edmunds.com> wrote:
>
>> Zach,
>>
>> I would just join the two relations:
>>
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#JOIN
>>
>> If you need to perform an outer join, use COGROUP instead:
>>
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#COGROUP
>>
>> Good luck.
>> Chad
>>
>> -----Original Message-----
>> From: Zach Murphy [mailto:murphy.zach@gmail.com]
>> Sent: Tuesday, July 21, 2009 9:28 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Matching across bags
>>
>> Hello,
>>
>> I'm a new pig user, and I'm working on a project using pig over the summer.
>> I've read over the archives and couldn't find any answer to my question so
>> I
>> figured I'd ask myself.
>>
>> I've loaded in two relations and I want to match a field in relation A to a
>> field in relation B.  So, I want to see for each A.f1 if it matches any of
>> the words in B.f3.
>>
>> The only way I saw to do this was to CROSS A and B and then try to match
>> each one.  However due to system restrictions, the CROSS becomes too large
>> and times out.
>>
>> Any ideas on how to go about doing this quicker and with less space would
>> be
>> greatly appreciated!
>>
>> Thanks,
>> Zach
>>
>>
>

Re: Matching across bags

Posted by Zach Murphy <mu...@gmail.com>.

Thanks for the quick reply Chad.  But, I should have been more clear.

The field in one bag is actually a chunk of text and the field in the other
bag is a word.  I want to see if the word is contained in the text.  So the
values aren't actually the same.  I believe that this means I can't JOIN or
COGROUP because the fields aren't exactly the same.  I have written a udf
that checks if the word is in the text.  But because the two fields aren't
in the same bag I don't think I can use X = FILTER bag BY IsWordInText(word,
text).

I suppose that I could write a udf for a modified JOIN that joins if the
word is in the text, but if there's an easier way I would rather do that.

Zach

On Tue, Jul 21, 2009 at 12:34 PM, Naber, Chad <CN...@edmunds.com> wrote:

> Zach,
>
> I would just join the two relations:
>
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#JOIN
>
> If you need to perform an outer join, use COGROUP instead:
>
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#COGROUP
>
> Good luck.
> Chad
>
> -----Original Message-----
> From: Zach Murphy [mailto:murphy.zach@gmail.com]
> Sent: Tuesday, July 21, 2009 9:28 AM
> To: pig-user@hadoop.apache.org
> Subject: Matching across bags
>
> Hello,
>
> I'm a new pig user, and I'm working on a project using pig over the summer.
> I've read over the archives and couldn't find any answer to my question so
> I
> figured I'd ask myself.
>
> I've loaded in two relations and I want to match a field in relation A to a
> field in relation B.  So, I want to see for each A.f1 if it matches any of
> the words in B.f3.
>
> The only way I saw to do this was to CROSS A and B and then try to match
> each one.  However due to system restrictions, the CROSS becomes too large
> and times out.
>
> Any ideas on how to go about doing this quicker and with less space would
> be
> greatly appreciated!
>
> Thanks,
> Zach
>
>

RE: Matching across bags

Posted by "Naber, Chad" <CN...@edmunds.com>.

Zach,

I would just join the two relations:

http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#JOIN

If you need to perform an outer join, use COGROUP instead:

http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#COGROUP

Good luck.
Chad

-----Original Message-----
From: Zach Murphy [mailto:murphy.zach@gmail.com] 
Sent: Tuesday, July 21, 2009 9:28 AM
To: pig-user@hadoop.apache.org
Subject: Matching across bags

Hello,

I'm a new pig user, and I'm working on a project using pig over the summer.
I've read over the archives and couldn't find any answer to my question so I
figured I'd ask myself.

I've loaded in two relations and I want to match a field in relation A to a
field in relation B.  So, I want to see for each A.f1 if it matches any of
the words in B.f3.

The only way I saw to do this was to CROSS A and B and then try to match
each one.  However due to system restrictions, the CROSS becomes too large
and times out.

Any ideas on how to go about doing this quicker and with less space would be
greatly appreciated!

Thanks,
Zach