You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marek Miglinski <mm...@seven.com> on 2011/09/12 16:19:20 UTC

JOINing two inputs

Hi,

I have a serious task to finish, hope somebody will help me... I have two inputs with data:

record1:
epoch,
game_id,
user_id,
other data

record2:
epoch,
game_id,
user_id,
other data

Now I need to JOIN record1 with record2 BY game_id, oper_id, user_id, epoch. BUT! epoch in record2 must be FIRST found data and it should be < than epoch in record1.

recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id, user_id); + add something like... CLOSEST(WHERE record1::epoch < record2::epoch);

So for example:

record1:
epoch::50
game_id::434
user_id::990

record2:
epoch::67
game_id::434
user_id::990
param1::pop

record2:
epoch::43
game_id::434
user_id::990
param1::wow

record2:
epoch::42
game_id::434
user_id::990
param1::slow

record2:
epoch::23
game_id::434
user_id::990
param1::fast


The result should be - record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::43, record2.param1::wow and ...

Is it possible to accomplish through PIG? Using JOIN or using FOREACH?



Sincerely,
Marek M.



Re: JOINing two inputs

Posted by Xiaomeng Wan <sh...@gmail.com>.
It really depends on what your 'first' means (first by what?)
if you just want a single row, you can group the filter results like this:

a = group filtered by (id1, id2);
b = foreach a { x = limit filtered 1; generate FLATTEN(x) as (...); }

if "first" implies some kind of order, you can change "x=limit
filtered 1;"  to "y = order filtered by xxx; x = limit y 1; "

Shawn

On Mon, Sep 12, 2011 at 4:23 PM, Marek Miglinski <mm...@seven.com> wrote:
> Any clues anyone? I still didn't get anything myself, thinking...
>
>
>
> Sincerely,
> Marek M.
>
> -----Original Message-----
> From: yonghu [mailto:yongyong313@gmail.com]
> Sent: Monday, September 12, 2011 10:21 PM
> To: user@pig.apache.org
> Subject: Re: JOINing two inputs
>
> Sorry, I didn't understand you right. I didn't think just use Pig operator can finish this problem. You can first use cogroup operator to group the two inputs together. Then apply a UDF to each tuple.
>
> On Mon, Sep 12, 2011 at 5:35 PM, Marek Miglinski <mm...@seven.com>wrote:
>
>> Thanks for fast reply ;)
>>
>> Ok, I've done this:
>> recordJoined = JOIN record1 BY (game_id, user_id), record2 BY
>> (game_id, user_id);
>>
>> Now I have:
>> record1.epoch::50, record1.game_id::434, record1.user_id::990,
>> record2.epoch::67, record2.param1::pop record1.epoch::50,
>> record1.game_id::434, record1.user_id::990, record2.epoch::43,
>> record2.param1::wow record1.epoch::50, record1.game_id::434,
>> record1.user_id::990, record2.epoch::42, record2.param1::slow
>> record1.epoch::50, record1.game_id::434, record1.user_id::990,
>> record2.epoch::23, record2.param1::fast (Other data)
>> record1.epoch::67, record1.game_id::564, record1.user_id::889,
>> record2.epoch::44, record2.param1::pop ...
>>
>> Now what?
>> I can do this:
>> recordFiltered = FILTER recordJoined BY record1::epoch >=
>> record2::epoch;
>>
>> It will give me:
>> record1.epoch::50, record1.game_id::434, record1.user_id::990,
>> record2.epoch::43, record2.param1::wow record1.epoch::50,
>> record1.game_id::434, record1.user_id::990, record2.epoch::42,
>> record2.param1::slow record1.epoch::50, record1.game_id::434,
>> record1.user_id::990, record2.epoch::23, record2.param1::fast (Other
>> data) record1.epoch::67, record1.game_id::564, record1.user_id::889,
>> record2.epoch::44, record2.param1::pop ...
>>
>> Still not what I want, I need:
>> record1.epoch::50, record1.game_id::434, record1.user_id::990,
>> record2.epoch::43, record2.param1::wow (Other data) record1.epoch::67,
>> record1.game_id::564, record1.user_id::889, record2.epoch::44,
>> record2.param1::pop ...
>>
>>
>>
>> Sincerely,
>> Marek M.
>>
>> ________________________________________
>> From: yonghu [yongyong313@gmail.com]
>> Sent: Monday, September 12, 2011 5:49 PM
>> To: user@pig.apache.org
>> Subject: Re: JOINing two inputs
>>
>> I think you can first use join and then for each tuple using filter.
>>
>> On Mon, Sep 12, 2011 at 4:19 PM, Marek Miglinski <mmiglinski@seven.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I have a serious task to finish, hope somebody will help me... I
>> > have two inputs with data:
>> >
>> > record1:
>> > epoch,
>> > game_id,
>> > user_id,
>> > other data
>> >
>> > record2:
>> > epoch,
>> > game_id,
>> > user_id,
>> > other data
>> >
>> > Now I need to JOIN record1 with record2 BY game_id, oper_id,
>> > user_id, epoch. BUT! epoch in record2 must be FIRST found data and
>> > it should be < than epoch in record1.
>> >
>> > recordJoined = JOIN record1 BY (game_id, user_id), record2 BY
>> > (game_id, user_id); + add something like... CLOSEST(WHERE
>> > record1::epoch < record2::epoch);
>> >
>> > So for example:
>> >
>> > record1:
>> > epoch::50
>> > game_id::434
>> > user_id::990
>> >
>> > record2:
>> > epoch::67
>> > game_id::434
>> > user_id::990
>> > param1::pop
>> >
>> > record2:
>> > epoch::43
>> > game_id::434
>> > user_id::990
>> > param1::wow
>> >
>> > record2:
>> > epoch::42
>> > game_id::434
>> > user_id::990
>> > param1::slow
>> >
>> > record2:
>> > epoch::23
>> > game_id::434
>> > user_id::990
>> > param1::fast
>> >
>> >
>> > The result should be - record1.epoch::50, record1.game_id::434,
>> > record1.user_id::990, record2.epoch::43, record2.param1::wow and ...
>> >
>> > Is it possible to accomplish through PIG? Using JOIN or using FOREACH?
>> >
>> >
>> >
>> > Sincerely,
>> > Marek M.
>> >
>> >
>> >
>>
>

RE: JOINing two inputs

Posted by Marek Miglinski <mm...@seven.com>.
Any clues anyone? I still didn't get anything myself, thinking... 



Sincerely,
Marek M.

-----Original Message-----
From: yonghu [mailto:yongyong313@gmail.com] 
Sent: Monday, September 12, 2011 10:21 PM
To: user@pig.apache.org
Subject: Re: JOINing two inputs

Sorry, I didn't understand you right. I didn't think just use Pig operator can finish this problem. You can first use cogroup operator to group the two inputs together. Then apply a UDF to each tuple.

On Mon, Sep 12, 2011 at 5:35 PM, Marek Miglinski <mm...@seven.com>wrote:

> Thanks for fast reply ;)
>
> Ok, I've done this:
> recordJoined = JOIN record1 BY (game_id, user_id), record2 BY 
> (game_id, user_id);
>
> Now I have:
> record1.epoch::50, record1.game_id::434, record1.user_id::990, 
> record2.epoch::67, record2.param1::pop record1.epoch::50, 
> record1.game_id::434, record1.user_id::990, record2.epoch::43, 
> record2.param1::wow record1.epoch::50, record1.game_id::434, 
> record1.user_id::990, record2.epoch::42, record2.param1::slow 
> record1.epoch::50, record1.game_id::434, record1.user_id::990, 
> record2.epoch::23, record2.param1::fast (Other data) 
> record1.epoch::67, record1.game_id::564, record1.user_id::889, 
> record2.epoch::44, record2.param1::pop ...
>
> Now what?
> I can do this:
> recordFiltered = FILTER recordJoined BY record1::epoch >= 
> record2::epoch;
>
> It will give me:
> record1.epoch::50, record1.game_id::434, record1.user_id::990, 
> record2.epoch::43, record2.param1::wow record1.epoch::50, 
> record1.game_id::434, record1.user_id::990, record2.epoch::42, 
> record2.param1::slow record1.epoch::50, record1.game_id::434, 
> record1.user_id::990, record2.epoch::23, record2.param1::fast (Other 
> data) record1.epoch::67, record1.game_id::564, record1.user_id::889, 
> record2.epoch::44, record2.param1::pop ...
>
> Still not what I want, I need:
> record1.epoch::50, record1.game_id::434, record1.user_id::990, 
> record2.epoch::43, record2.param1::wow (Other data) record1.epoch::67, 
> record1.game_id::564, record1.user_id::889, record2.epoch::44, 
> record2.param1::pop ...
>
>
>
> Sincerely,
> Marek M.
>
> ________________________________________
> From: yonghu [yongyong313@gmail.com]
> Sent: Monday, September 12, 2011 5:49 PM
> To: user@pig.apache.org
> Subject: Re: JOINing two inputs
>
> I think you can first use join and then for each tuple using filter.
>
> On Mon, Sep 12, 2011 at 4:19 PM, Marek Miglinski <mmiglinski@seven.com
> >wrote:
>
> > Hi,
> >
> > I have a serious task to finish, hope somebody will help me... I 
> > have two inputs with data:
> >
> > record1:
> > epoch,
> > game_id,
> > user_id,
> > other data
> >
> > record2:
> > epoch,
> > game_id,
> > user_id,
> > other data
> >
> > Now I need to JOIN record1 with record2 BY game_id, oper_id, 
> > user_id, epoch. BUT! epoch in record2 must be FIRST found data and 
> > it should be < than epoch in record1.
> >
> > recordJoined = JOIN record1 BY (game_id, user_id), record2 BY 
> > (game_id, user_id); + add something like... CLOSEST(WHERE 
> > record1::epoch < record2::epoch);
> >
> > So for example:
> >
> > record1:
> > epoch::50
> > game_id::434
> > user_id::990
> >
> > record2:
> > epoch::67
> > game_id::434
> > user_id::990
> > param1::pop
> >
> > record2:
> > epoch::43
> > game_id::434
> > user_id::990
> > param1::wow
> >
> > record2:
> > epoch::42
> > game_id::434
> > user_id::990
> > param1::slow
> >
> > record2:
> > epoch::23
> > game_id::434
> > user_id::990
> > param1::fast
> >
> >
> > The result should be - record1.epoch::50, record1.game_id::434, 
> > record1.user_id::990, record2.epoch::43, record2.param1::wow and ...
> >
> > Is it possible to accomplish through PIG? Using JOIN or using FOREACH?
> >
> >
> >
> > Sincerely,
> > Marek M.
> >
> >
> >
>

Re: JOINing two inputs

Posted by yonghu <yo...@gmail.com>.
Sorry, I didn't understand you right. I didn't think just use Pig operator
can finish this problem. You can first use cogroup operator to group the two
inputs together. Then apply a UDF to each tuple.

On Mon, Sep 12, 2011 at 5:35 PM, Marek Miglinski <mm...@seven.com>wrote:

> Thanks for fast reply ;)
>
> Ok, I've done this:
> recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id,
> user_id);
>
> Now I have:
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::67, record2.param1::pop
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::43, record2.param1::wow
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::42, record2.param1::slow
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::23, record2.param1::fast
> (Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889,
> record2.epoch::44, record2.param1::pop
> ...
>
> Now what?
> I can do this:
> recordFiltered = FILTER recordJoined BY record1::epoch >= record2::epoch;
>
> It will give me:
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::43, record2.param1::wow
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::42, record2.param1::slow
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::23, record2.param1::fast
> (Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889,
> record2.epoch::44, record2.param1::pop
> ...
>
> Still not what I want, I need:
> record1.epoch::50, record1.game_id::434, record1.user_id::990,
> record2.epoch::43, record2.param1::wow
> (Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889,
> record2.epoch::44, record2.param1::pop
> ...
>
>
>
> Sincerely,
> Marek M.
>
> ________________________________________
> From: yonghu [yongyong313@gmail.com]
> Sent: Monday, September 12, 2011 5:49 PM
> To: user@pig.apache.org
> Subject: Re: JOINing two inputs
>
> I think you can first use join and then for each tuple using filter.
>
> On Mon, Sep 12, 2011 at 4:19 PM, Marek Miglinski <mmiglinski@seven.com
> >wrote:
>
> > Hi,
> >
> > I have a serious task to finish, hope somebody will help me... I have two
> > inputs with data:
> >
> > record1:
> > epoch,
> > game_id,
> > user_id,
> > other data
> >
> > record2:
> > epoch,
> > game_id,
> > user_id,
> > other data
> >
> > Now I need to JOIN record1 with record2 BY game_id, oper_id, user_id,
> > epoch. BUT! epoch in record2 must be FIRST found data and it should be <
> > than epoch in record1.
> >
> > recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id,
> > user_id); + add something like... CLOSEST(WHERE record1::epoch <
> > record2::epoch);
> >
> > So for example:
> >
> > record1:
> > epoch::50
> > game_id::434
> > user_id::990
> >
> > record2:
> > epoch::67
> > game_id::434
> > user_id::990
> > param1::pop
> >
> > record2:
> > epoch::43
> > game_id::434
> > user_id::990
> > param1::wow
> >
> > record2:
> > epoch::42
> > game_id::434
> > user_id::990
> > param1::slow
> >
> > record2:
> > epoch::23
> > game_id::434
> > user_id::990
> > param1::fast
> >
> >
> > The result should be - record1.epoch::50, record1.game_id::434,
> > record1.user_id::990, record2.epoch::43, record2.param1::wow and ...
> >
> > Is it possible to accomplish through PIG? Using JOIN or using FOREACH?
> >
> >
> >
> > Sincerely,
> > Marek M.
> >
> >
> >
>

RE: JOINing two inputs

Posted by Marek Miglinski <mm...@seven.com>.
Thanks for fast reply ;)

Ok, I've done this:
recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id, user_id);

Now I have:
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::67, record2.param1::pop
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::43, record2.param1::wow
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::42, record2.param1::slow
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::23, record2.param1::fast
(Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889, record2.epoch::44, record2.param1::pop
...

Now what?
I can do this:
recordFiltered = FILTER recordJoined BY record1::epoch >= record2::epoch;

It will give me:
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::43, record2.param1::wow
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::42, record2.param1::slow
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::23, record2.param1::fast
(Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889, record2.epoch::44, record2.param1::pop
...

Still not what I want, I need:
record1.epoch::50, record1.game_id::434, record1.user_id::990, record2.epoch::43, record2.param1::wow
(Other data) record1.epoch::67, record1.game_id::564, record1.user_id::889, record2.epoch::44, record2.param1::pop
...



Sincerely,
Marek M.

________________________________________
From: yonghu [yongyong313@gmail.com]
Sent: Monday, September 12, 2011 5:49 PM
To: user@pig.apache.org
Subject: Re: JOINing two inputs

I think you can first use join and then for each tuple using filter.

On Mon, Sep 12, 2011 at 4:19 PM, Marek Miglinski <mm...@seven.com>wrote:

> Hi,
>
> I have a serious task to finish, hope somebody will help me... I have two
> inputs with data:
>
> record1:
> epoch,
> game_id,
> user_id,
> other data
>
> record2:
> epoch,
> game_id,
> user_id,
> other data
>
> Now I need to JOIN record1 with record2 BY game_id, oper_id, user_id,
> epoch. BUT! epoch in record2 must be FIRST found data and it should be <
> than epoch in record1.
>
> recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id,
> user_id); + add something like... CLOSEST(WHERE record1::epoch <
> record2::epoch);
>
> So for example:
>
> record1:
> epoch::50
> game_id::434
> user_id::990
>
> record2:
> epoch::67
> game_id::434
> user_id::990
> param1::pop
>
> record2:
> epoch::43
> game_id::434
> user_id::990
> param1::wow
>
> record2:
> epoch::42
> game_id::434
> user_id::990
> param1::slow
>
> record2:
> epoch::23
> game_id::434
> user_id::990
> param1::fast
>
>
> The result should be - record1.epoch::50, record1.game_id::434,
> record1.user_id::990, record2.epoch::43, record2.param1::wow and ...
>
> Is it possible to accomplish through PIG? Using JOIN or using FOREACH?
>
>
>
> Sincerely,
> Marek M.
>
>
>

Re: JOINing two inputs

Posted by yonghu <yo...@gmail.com>.
I think you can first use join and then for each tuple using filter.

On Mon, Sep 12, 2011 at 4:19 PM, Marek Miglinski <mm...@seven.com>wrote:

> Hi,
>
> I have a serious task to finish, hope somebody will help me... I have two
> inputs with data:
>
> record1:
> epoch,
> game_id,
> user_id,
> other data
>
> record2:
> epoch,
> game_id,
> user_id,
> other data
>
> Now I need to JOIN record1 with record2 BY game_id, oper_id, user_id,
> epoch. BUT! epoch in record2 must be FIRST found data and it should be <
> than epoch in record1.
>
> recordJoined = JOIN record1 BY (game_id, user_id), record2 BY (game_id,
> user_id); + add something like... CLOSEST(WHERE record1::epoch <
> record2::epoch);
>
> So for example:
>
> record1:
> epoch::50
> game_id::434
> user_id::990
>
> record2:
> epoch::67
> game_id::434
> user_id::990
> param1::pop
>
> record2:
> epoch::43
> game_id::434
> user_id::990
> param1::wow
>
> record2:
> epoch::42
> game_id::434
> user_id::990
> param1::slow
>
> record2:
> epoch::23
> game_id::434
> user_id::990
> param1::fast
>
>
> The result should be - record1.epoch::50, record1.game_id::434,
> record1.user_id::990, record2.epoch::43, record2.param1::wow and ...
>
> Is it possible to accomplish through PIG? Using JOIN or using FOREACH?
>
>
>
> Sincerely,
> Marek M.
>
>
>