You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kris Coward <kr...@melon.org> on 2011/01/07 18:20:55 UTC

Joining inner and outer bags

Hi,

I've got an outer bag/relation consistig of a bunch of user information,
one of the pieces of which is an inner bag of possible events for that
user, and the value of those events, should they occur. Outside the bag,
there are also a few data concerning whether specific events have
already occurred.

In another relation, I have the assortment of events grouped with the
probability that any of them will occur.

I'd like to generate expected values for each user, but know that I
can't JOIN within a FOREACH block (or do a nested FOREACH). For a UDF,
I vaguely recall some sort of constraint on nesting inner bags that
would interfere with my ability to bundle the possible events bag with
the actual events data into a single object that could be passed to a
UDF that extends EvalFunc.

Am I misremembering something? Is there some other sort of clever
trickery that I might be able to use to generate expected values if I'm
not? (and if I am, is there something less hackish than a GROUP on a
unique tuple element that I could use to load the desired values into a
bag or tuple (or just plain pass the entire tuple to a UDF)?

Thanks,
Kris

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Joining inner and outer bags

Posted by Kris Coward <kr...@melon.org>.
On Fri, Jan 07, 2011 at 10:44:03AM -0800, Thejas M Nair wrote:
> On 1/7/11 9:20 AM, "Kris Coward" <kr...@melon.org> wrote:
> > I've got an outer bag/relation consistig of a bunch of user information,
> > one of the pieces of which is an inner bag of possible events for that
> > user, and the value of those events, should they occur. Outside the bag,
> > there are also a few data concerning whether specific events have
> > already occurred.
> > 
> > In another relation, I have the assortment of events grouped with the
> > probability that any of them will occur.
> > 
> > I'd like to generate expected values for each user, but know that I
> > can't JOIN within a FOREACH block (or do a nested FOREACH). For a UDF,
> > I vaguely recall some sort of constraint on nesting inner bags that
> > would interfere with my ability to bundle the possible events bag with
> > the actual events data into a single object that could be passed to a
> > UDF that extends EvalFunc.
> I can't think of any limitations that would prevent you from writing such an
> udf.
> You can pass the bag of events to the udf, and have the udf append the
> probability information to tuples in the bag and return the new bag. I am
> assuming that the even probability relation is small enough to be stored in
> memory.
> 
> > Am I misremembering something? Is there some other sort of clever
> > trickery that I might be able to use to generate expected values if I'm
> > not? (and if I am, is there something less hackish than a GROUP on a
> > unique tuple element that I could use to load the desired values into a
> > bag or tuple (or just plain pass the entire tuple to a UDF)?
> 
> Is this the alternative solution you are trying to avoid ? - do a (foreach-)
> flatten on the events bag of first relation, do a join (using 'replicated'
> if the 2nd relation is small enough), and then do a group-by on user (id).
> This will not involve writing a UDF, but it will have an additional reduce
> phase for the group-by. If you use a udf that appends the information, it
> will be a map-only job.

I'm not trying to avoid that solution at all.. FLATTENing the events bag
and then reGROUPing it after the seems like it's probably the solution I
was looking for (the bag had been ORDERed before, and some information
was present in the ordering, but I can separate that information out so
that it survives FLATTEN.

Thanks,
Kris

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Joining inner and outer bags

Posted by Thejas M Nair <te...@yahoo-inc.com>.




On 1/7/11 9:20 AM, "Kris Coward" <kr...@melon.org> wrote:

> Hi,
> 
> I've got an outer bag/relation consistig of a bunch of user information,
> one of the pieces of which is an inner bag of possible events for that
> user, and the value of those events, should they occur. Outside the bag,
> there are also a few data concerning whether specific events have
> already occurred.
> 
> In another relation, I have the assortment of events grouped with the
> probability that any of them will occur.
> 
> I'd like to generate expected values for each user, but know that I
> can't JOIN within a FOREACH block (or do a nested FOREACH). For a UDF,
> I vaguely recall some sort of constraint on nesting inner bags that
> would interfere with my ability to bundle the possible events bag with
> the actual events data into a single object that could be passed to a
> UDF that extends EvalFunc.
I can't think of any limitations that would prevent you from writing such an
udf.
You can pass the bag of events to the udf, and have the udf append the
probability information to tuples in the bag and return the new bag. I am
assuming that the even probability relation is small enough to be stored in
memory.


> 
> Am I misremembering something? Is there some other sort of clever
> trickery that I might be able to use to generate expected values if I'm
> not? (and if I am, is there something less hackish than a GROUP on a
> unique tuple element that I could use to load the desired values into a
> bag or tuple (or just plain pass the entire tuple to a UDF)?

Is this the alternative solution you are trying to avoid ? - do a (foreach-)
flatten on the events bag of first relation, do a join (using 'replicated'
if the 2nd relation is small enough), and then do a group-by on user (id).
This will not involve writing a UDF, but it will have an additional reduce
phase for the group-by. If you use a udf that appends the information, it
will be a map-only job.

-Thejas