You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mike Hugo <mi...@piragua.com> on 2011/07/12 21:45:42 UTC

Advice on algorithm for joining data in bags

I'm trying to join together several different sources of synonyms using Pig.
 For example:

A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
label:chararray);
DUMP A;
(12,synonym1)
(12,alternative_name)
(45,synonym1 full name and description)
(45,synonym1)
(45,synonym1_expanded)
(78,synonym1)
(67,synonym1)

I've managed to group things together by the label...

C = GROUP A BY label;
DUMP C;
(synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
(alternative_name,{(12,alternative_name)})
(synonym1_expanded,{(45,synonym1_expanded)})
(synonym1 full name and description,{(45,synonym1 full name and
description)})

And then flatten them out a little bit:

D = FOREACH C GENERATE $0, $1.id;
DUMP D;
(synonym1,{(12),(45),(67)})
(alternative_name,{(12),(78)})
(synonym1_expanded,{(45)})
(synonym1 full name and description,{(45)})


If you look closely at the data, it turns out that this example test data
set is really all the same - the synonyms all overlap.  The final output I'd
like to get to is something like this (the arbitrary_id could be anything, I
really just need a set of the overlapping IDs):

(arbitrary_id, {12, 45, 67, 78})

How can I join on the bag of IDs in 'D' to find other labels that have at
least one of the same IDs?  Or am I approaching this the wrong way?

Thanks,

Mike

Re: Advice on algorithm for joining data in bags

Posted by Mike Hugo <mi...@piragua.com>.
Great thanks John!  I think I'm down the right path then.

To answer your final question about the alternative name - basically you can
consider each id as a distinct datasource of synonyms.  I'm trying to join
them all together in a single repository.  Looking at the example again,

12 synonym1
12 alternative_name
45 synonym1 full name and description
45 synonym1
45 synonym1_expanded
78 alternative_name
67 synonym1
34 synonym2
34 synonym2_expanded
56 synonym2
89 synonym2_expanded

12 has two "labels" - synonym1 and alternative_name.  synonym1 is found in
45, 12, and 67 so we now know 45, 12, and 67 are the same thing.
 alternative name is found in 12 and 78, so we now know that 12 and 78 are
the same thing.  12 is found in both the first set (45, 12, and 67) and the
second set (12, 78), so we now know those two sets are the same thing,
resulting in the desired output of (12, 45, 67, 78).  The same logic can be
applied to the next set of data:  synonym2 is found in 34 and 56, so they
are the same thing.  synonym2_expanded is found in 34 and 89, so they are
the same thing.  34 is found in both sets, so the final output for that
chunk of data is (34, 56, 89).

Thanks for the help, I'll keep playing around with this and take a look at
building a UDF.

Mike

On Wed, Jul 13, 2011 at 11:01 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I would group on the label column, and then just take the distinct values
> in
> the id column. You may need to make a UDF or just do some processing to
> turn
> synonym2_expanded into synonym2, but it sounds like that's what you want to
> do. I guess I'm not sure how alternative_name works into this?
>
> 2011/7/13 Mike Hugo <mi...@piragua.com>
>
> > Thanks so much for the input John!  That's not quite what I'm looking for
> -
> > I realize now that my example is not fully complete.  There may be
> > different
> > sets of synonyms in the input file.  For example:
> >
> > 12 synonym1
> > 12 alternative_name
> > 45 synonym1 full name and description
> > 45 synonym1
> > 45 synonym1_expanded
> > 78 alternative_name
> > 67 synonym1
> > 34 synonym2
> > 34 synonym2_expanded
> > 56 synonym2
> > 89 synonym2_expanded
> >
> > Then the desired output would be:
> >
> > (arbitrary_id_1, {12, 45, 67, 78})
> > (arbitrary_id_2, {34, 56, 89})
> >
> > (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
> > therefore the set of IDs for synonym2 is 34, 56, 89)
> >
> > The arbitrary ID could be a row label, but it doesn't really matter, what
> > I'm really interested in is the bag of ids.
> >
> > Mike
> >
> > On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <jo...@iamjohn.me> wrote:
> >
> > > If I understand you correctly, what you want in the end is a bag with
> all
> > > distinct ids from the original dataset, regardless of the row label.
>  The
> > > following will get you that (if thats what your looking for).  Note,
> that
> > > in
> > > the for LOAD statement, I specified a comma as the delimiter.
> > >
> > > a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> > > label:chararray);
> > >
> > > b = FOREACH a GENERATE id;
> > >
> > > c = GROUP b BY id;
> > >
> > > d = FOREACH c GENERATE group;
> > >
> > > e = GROUP d ALL;
> > >
> > > dump e
> > >
> > > (all,{(12),(45),(67),(78)})
> > >
> > >
> > >
> > >
> > > On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <mi...@piragua.com> wrote:
> > >
> > > > I'm trying to join together several different sources of synonyms
> using
> > > > Pig.
> > > >  For example:
> > > >
> > > > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > > > label:chararray);
> > > > DUMP A;
> > > > (12,synonym1)
> > > > (12,alternative_name)
> > > > (45,synonym1 full name and description)
> > > > (45,synonym1)
> > > > (45,synonym1_expanded)
> > > > (78,synonym1)
> > > > (67,synonym1)
> > > >
> > > > I've managed to group things together by the label...
> > > >
> > > > C = GROUP A BY label;
> > > > DUMP C;
> > > > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > > > (alternative_name,{(12,alternative_name)})
> > > > (synonym1_expanded,{(45,synonym1_expanded)})
> > > > (synonym1 full name and description,{(45,synonym1 full name and
> > > > description)})
> > > >
> > > > And then flatten them out a little bit:
> > > >
> > > > D = FOREACH C GENERATE $0, $1.id;
> > > > DUMP D;
> > > > (synonym1,{(12),(45),(67)})
> > > > (alternative_name,{(12),(78)})
> > > > (synonym1_expanded,{(45)})
> > > > (synonym1 full name and description,{(45)})
> > > >
> > > >
> > > > If you look closely at the data, it turns out that this example test
> > data
> > > > set is really all the same - the synonyms all overlap.  The final
> > output
> > > > I'd
> > > > like to get to is something like this (the arbitrary_id could be
> > > anything,
> > > > I
> > > > really just need a set of the overlapping IDs):
> > > >
> > > > (arbitrary_id, {12, 45, 67, 78})
> > > >
> > > > How can I join on the bag of IDs in 'D' to find other labels that
> have
> > at
> > > > least one of the same IDs?  Or am I approaching this the wrong way?
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> >
>

Re: Advice on algorithm for joining data in bags

Posted by Jonathan Coveney <jc...@gmail.com>.
I would group on the label column, and then just take the distinct values in
the id column. You may need to make a UDF or just do some processing to turn
synonym2_expanded into synonym2, but it sounds like that's what you want to
do. I guess I'm not sure how alternative_name works into this?

2011/7/13 Mike Hugo <mi...@piragua.com>

> Thanks so much for the input John!  That's not quite what I'm looking for -
> I realize now that my example is not fully complete.  There may be
> different
> sets of synonyms in the input file.  For example:
>
> 12 synonym1
> 12 alternative_name
> 45 synonym1 full name and description
> 45 synonym1
> 45 synonym1_expanded
> 78 alternative_name
> 67 synonym1
> 34 synonym2
> 34 synonym2_expanded
> 56 synonym2
> 89 synonym2_expanded
>
> Then the desired output would be:
>
> (arbitrary_id_1, {12, 45, 67, 78})
> (arbitrary_id_2, {34, 56, 89})
>
> (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
> therefore the set of IDs for synonym2 is 34, 56, 89)
>
> The arbitrary ID could be a row label, but it doesn't really matter, what
> I'm really interested in is the bag of ids.
>
> Mike
>
> On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <jo...@iamjohn.me> wrote:
>
> > If I understand you correctly, what you want in the end is a bag with all
> > distinct ids from the original dataset, regardless of the row label.  The
> > following will get you that (if thats what your looking for).  Note, that
> > in
> > the for LOAD statement, I specified a comma as the delimiter.
> >
> > a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> > label:chararray);
> >
> > b = FOREACH a GENERATE id;
> >
> > c = GROUP b BY id;
> >
> > d = FOREACH c GENERATE group;
> >
> > e = GROUP d ALL;
> >
> > dump e
> >
> > (all,{(12),(45),(67),(78)})
> >
> >
> >
> >
> > On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <mi...@piragua.com> wrote:
> >
> > > I'm trying to join together several different sources of synonyms using
> > > Pig.
> > >  For example:
> > >
> > > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > > label:chararray);
> > > DUMP A;
> > > (12,synonym1)
> > > (12,alternative_name)
> > > (45,synonym1 full name and description)
> > > (45,synonym1)
> > > (45,synonym1_expanded)
> > > (78,synonym1)
> > > (67,synonym1)
> > >
> > > I've managed to group things together by the label...
> > >
> > > C = GROUP A BY label;
> > > DUMP C;
> > > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > > (alternative_name,{(12,alternative_name)})
> > > (synonym1_expanded,{(45,synonym1_expanded)})
> > > (synonym1 full name and description,{(45,synonym1 full name and
> > > description)})
> > >
> > > And then flatten them out a little bit:
> > >
> > > D = FOREACH C GENERATE $0, $1.id;
> > > DUMP D;
> > > (synonym1,{(12),(45),(67)})
> > > (alternative_name,{(12),(78)})
> > > (synonym1_expanded,{(45)})
> > > (synonym1 full name and description,{(45)})
> > >
> > >
> > > If you look closely at the data, it turns out that this example test
> data
> > > set is really all the same - the synonyms all overlap.  The final
> output
> > > I'd
> > > like to get to is something like this (the arbitrary_id could be
> > anything,
> > > I
> > > really just need a set of the overlapping IDs):
> > >
> > > (arbitrary_id, {12, 45, 67, 78})
> > >
> > > How can I join on the bag of IDs in 'D' to find other labels that have
> at
> > > least one of the same IDs?  Or am I approaching this the wrong way?
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>

Re: Advice on algorithm for joining data in bags

Posted by Mike Hugo <mi...@piragua.com>.
Thanks so much for the input John!  That's not quite what I'm looking for -
I realize now that my example is not fully complete.  There may be different
sets of synonyms in the input file.  For example:

12 synonym1
12 alternative_name
45 synonym1 full name and description
45 synonym1
45 synonym1_expanded
78 alternative_name
67 synonym1
34 synonym2
34 synonym2_expanded
56 synonym2
89 synonym2_expanded

Then the desired output would be:

(arbitrary_id_1, {12, 45, 67, 78})
(arbitrary_id_2, {34, 56, 89})

(34 has a synonym that matches 56, and 34 has a synonym that matches 89,
therefore the set of IDs for synonym2 is 34, 56, 89)

The arbitrary ID could be a row label, but it doesn't really matter, what
I'm really interested in is the bag of ids.

Mike

On Wed, Jul 13, 2011 at 10:13 AM, John Conwell <jo...@iamjohn.me> wrote:

> If I understand you correctly, what you want in the end is a bag with all
> distinct ids from the original dataset, regardless of the row label.  The
> following will get you that (if thats what your looking for).  Note, that
> in
> the for LOAD statement, I specified a comma as the delimiter.
>
> a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
> label:chararray);
>
> b = FOREACH a GENERATE id;
>
> c = GROUP b BY id;
>
> d = FOREACH c GENERATE group;
>
> e = GROUP d ALL;
>
> dump e
>
> (all,{(12),(45),(67),(78)})
>
>
>
>
> On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <mi...@piragua.com> wrote:
>
> > I'm trying to join together several different sources of synonyms using
> > Pig.
> >  For example:
> >
> > A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> > label:chararray);
> > DUMP A;
> > (12,synonym1)
> > (12,alternative_name)
> > (45,synonym1 full name and description)
> > (45,synonym1)
> > (45,synonym1_expanded)
> > (78,synonym1)
> > (67,synonym1)
> >
> > I've managed to group things together by the label...
> >
> > C = GROUP A BY label;
> > DUMP C;
> > (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> > (alternative_name,{(12,alternative_name)})
> > (synonym1_expanded,{(45,synonym1_expanded)})
> > (synonym1 full name and description,{(45,synonym1 full name and
> > description)})
> >
> > And then flatten them out a little bit:
> >
> > D = FOREACH C GENERATE $0, $1.id;
> > DUMP D;
> > (synonym1,{(12),(45),(67)})
> > (alternative_name,{(12),(78)})
> > (synonym1_expanded,{(45)})
> > (synonym1 full name and description,{(45)})
> >
> >
> > If you look closely at the data, it turns out that this example test data
> > set is really all the same - the synonyms all overlap.  The final output
> > I'd
> > like to get to is something like this (the arbitrary_id could be
> anything,
> > I
> > really just need a set of the overlapping IDs):
> >
> > (arbitrary_id, {12, 45, 67, 78})
> >
> > How can I join on the bag of IDs in 'D' to find other labels that have at
> > least one of the same IDs?  Or am I approaching this the wrong way?
> >
> > Thanks,
> >
> > Mike
> >
>
>
>
> --
>
> Thanks,
> John C
>

Re: Advice on algorithm for joining data in bags

Posted by John Conwell <jo...@iamjohn.me>.
If I understand you correctly, what you want in the end is a bag with all
distinct ids from the original dataset, regardless of the row label.  The
following will get you that (if thats what your looking for).  Note, that in
the for LOAD statement, I specified a comma as the delimiter.

a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
label:chararray);

b = FOREACH a GENERATE id;

c = GROUP b BY id;

d = FOREACH c GENERATE group;

e = GROUP d ALL;

dump e

(all,{(12),(45),(67),(78)})




On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo <mi...@piragua.com> wrote:

> I'm trying to join together several different sources of synonyms using
> Pig.
>  For example:
>
> A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
> label:chararray);
> DUMP A;
> (12,synonym1)
> (12,alternative_name)
> (45,synonym1 full name and description)
> (45,synonym1)
> (45,synonym1_expanded)
> (78,synonym1)
> (67,synonym1)
>
> I've managed to group things together by the label...
>
> C = GROUP A BY label;
> DUMP C;
> (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
> (alternative_name,{(12,alternative_name)})
> (synonym1_expanded,{(45,synonym1_expanded)})
> (synonym1 full name and description,{(45,synonym1 full name and
> description)})
>
> And then flatten them out a little bit:
>
> D = FOREACH C GENERATE $0, $1.id;
> DUMP D;
> (synonym1,{(12),(45),(67)})
> (alternative_name,{(12),(78)})
> (synonym1_expanded,{(45)})
> (synonym1 full name and description,{(45)})
>
>
> If you look closely at the data, it turns out that this example test data
> set is really all the same - the synonyms all overlap.  The final output
> I'd
> like to get to is something like this (the arbitrary_id could be anything,
> I
> really just need a set of the overlapping IDs):
>
> (arbitrary_id, {12, 45, 67, 78})
>
> How can I join on the bag of IDs in 'D' to find other labels that have at
> least one of the same IDs?  Or am I approaching this the wrong way?
>
> Thanks,
>
> Mike
>



-- 

Thanks,
John C