You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by sonia gehlot <so...@gmail.com> on 2012/07/05 21:21:37 UTC

Join with greater/less then condition

Hi Guys,

I want to join 2 tables in hive on couple of columns and out them one
condition is timestamp of one column is greater then the other one. In SQL
I could have written in this way:

table_a a Join table_b b
on a.user_id = b.user_id
and a.title_id = b.title_id
and a.timestamp > b.timestamp

How to write last condition in Pig? *a.timestamp > b.timestamp*

Thanks,
Sonia

Re: Join with greater/less then condition

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Replace the filter with a foreach / generate (replace the right side
with nulls of > condition is not satisfied)

On Thu, Jul 5, 2012 at 2:28 PM, sonia gehlot <so...@gmail.com> wrote:
> Actually I wanted to do left outer join, so not sure if filter will work in
> this case.
>
>
> On Thu, Jul 5, 2012 at 12:43 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
>> Pig can only do equi-joins.  Theta joins are hard in MapReduce.  So the
>> way to do this is do the equi-join and then filter afterwards.  This will
>> not create significant additional cost since the join results will be
>> filtered before being materialized to disk.
>>
>> C = Join table_a on user_id, title_id, table_b on user_id, title_id;
>> D = filter C by table_a::timestamp > table_b::timestamp;
>>
>> Alan.
>>
>> On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote:
>>
>> > Hi Guys,
>> >
>> > I want to join 2 tables in hive on couple of columns and out them one
>> > condition is timestamp of one column is greater then the other one. In
>> SQL
>> > I could have written in this way:
>> >
>> > table_a a Join table_b b
>> > on a.user_id = b.user_id
>> > and a.title_id = b.title_id
>> > and a.timestamp > b.timestamp
>> >
>> > How to write last condition in Pig? *a.timestamp > b.timestamp*
>> >
>> > Thanks,
>> > Sonia
>>
>>

Re: Join with greater/less then condition

Posted by sonia gehlot <so...@gmail.com>.
Actually I wanted to do left outer join, so not sure if filter will work in
this case.


On Thu, Jul 5, 2012 at 12:43 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Pig can only do equi-joins.  Theta joins are hard in MapReduce.  So the
> way to do this is do the equi-join and then filter afterwards.  This will
> not create significant additional cost since the join results will be
> filtered before being materialized to disk.
>
> C = Join table_a on user_id, title_id, table_b on user_id, title_id;
> D = filter C by table_a::timestamp > table_b::timestamp;
>
> Alan.
>
> On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote:
>
> > Hi Guys,
> >
> > I want to join 2 tables in hive on couple of columns and out them one
> > condition is timestamp of one column is greater then the other one. In
> SQL
> > I could have written in this way:
> >
> > table_a a Join table_b b
> > on a.user_id = b.user_id
> > and a.title_id = b.title_id
> > and a.timestamp > b.timestamp
> >
> > How to write last condition in Pig? *a.timestamp > b.timestamp*
> >
> > Thanks,
> > Sonia
>
>

Re: Join with greater/less then condition

Posted by Alan Gates <ga...@hortonworks.com>.
Pig can only do equi-joins.  Theta joins are hard in MapReduce.  So the way to do this is do the equi-join and then filter afterwards.  This will not create significant additional cost since the join results will be filtered before being materialized to disk.

C = Join table_a on user_id, title_id, table_b on user_id, title_id;
D = filter C by table_a::timestamp > table_b::timestamp;

Alan.

On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote:

> Hi Guys,
> 
> I want to join 2 tables in hive on couple of columns and out them one
> condition is timestamp of one column is greater then the other one. In SQL
> I could have written in this way:
> 
> table_a a Join table_b b
> on a.user_id = b.user_id
> and a.title_id = b.title_id
> and a.timestamp > b.timestamp
> 
> How to write last condition in Pig? *a.timestamp > b.timestamp*
> 
> Thanks,
> Sonia