You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2009/04/08 17:50:19 UTC

Re: how to join two tables with a condition

First, as an aside, this email really should be on pig-user rather  
than pig-dev, as it's a usage question, not a development question.   
So I've pushed it onto that list and replied to you directly in case  
you're not on that list.

If I understand correctly you want to do a non-equijoin on the data.    
That can be done as follows:

Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
Table2 = LOAD 'Table2' AS (startip, endip);
Crossed = CROSS Table1, Table2;
Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

Note that this will not be very efficient because it has to do a  
complete cross product of all tuples.  Pig does not support non- 
equijoins currently.  In general, non-equijoins are hard to do  
efficiently in map reduce because it's hard to get all of the  
appropriate keys together in the same reducers.

So, if you're going to do this on very large data, it will be very slow.

Alan.

On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:

>>
>> Hi
>> i want some operation on pig;
>> I have two tables of data
>> Table1 contains userid,ipaddress, date
>> Table2 contains startip,endip
>>
>> i want the data for the fallowing query
>>  Table1.ipaddress>table2.startip &
>> table1.ipaddress<table2.endip
>>
>> how to write join or cogroup in pig using piglatin scripts
>>
>> can any one help in this
>>
>>
>> thanks
>> ramana
>>

Re: how to join two tables with a condition

Posted by Chris Olston <ol...@yahoo-inc.com>.

There are a bunch of papers on doing "band joins" efficiently, which I
believe is what we're seeing an instance of. And like most join techniques
they are probably easy to parallelize.

If this turns out to be a frequent usage pattern for pig applications we can
consider doing something special for them. If it's rare, I'm afraid users
will have to make do with cross-product. (The good news is that Pig uses an
NxM parallel cross-product implementation, so with enough machines/threads
the data should become sufficiently chopped so that each one only needs to
deal with a small amount of data. Not sure how the current Pig code chooses
N and M though; I don't think we ever did much tuning on this.)

-Chris


On 4/8/09 3:16 PM, "Yiping Han" <yh...@yahoo-inc.com> wrote:

> Sorry, I misunderstood "one map". Yes, the keys are required in one map (or
> one shard).
> 
> As long as it is a map-side join, all tables join on the same key, and the
> filtering is per row basis. It should be easy to implement.
> 
> 
> --Yiping
> 
> On 4/8/09 3:11 PM, "Yiping Han" <yh...@yahoo-inc.com> wrote:
> 
>> It does not require all the keys are in one map, it could be in multiple
>> maps.
>> But it has to be the key that is used to partition the shards.
>> 
>> I agree this does not solve all WHERE clauses. But it does work for many
>> cases, like filtering people at a certain age, limit only to urls that has a
>> certain number of inlinks, etc.
>> 
>> As long as such filtering reduce a significant portion of the rows, it should
>> benefit us quite a lot on performance.
>> 
>> 
>> --Yiping
>> 
>> 
>> On 4/8/09 2:57 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>> 
>>> The issue here is that the odds that all of the keys you want are in
>>> one map are very low, unless you use only one map for the file.  Since
>>> you aren't doing equijoin, even if your files are partitioned the same
>>> way this won't guarantee that all the keys you need to do the join are
>>> in the same map.
>>> 
>>> Alan.
>>> 
>>> On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:
>>> 
>>>> If such a join could be a map-side join, this can be done efficiently.
>>>> 
>>>> Actually I was discussing with Nathan from Hadoop Table about this
>>>> yesterday. For those of us who will go for Hadoop Table, such an early
>>>> filtering should be pushed down to Table layer.
>>>> 
>>>> In general, for an query like:
>>>> 
>>>> SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
>>>> T3.f3 =
>>>> T4.f4
>>>> 
>>>> As in map-side joint, we should join each row for those tables that
>>>> has an
>>>> filter function first (here T1 with T2 and T3 wth T4) and filter,
>>>> then join
>>>> with the rest tables. (of course projection before filtering for
>>>> each table)
>>>> 
>>>> 
>>>> 
>>>> --Yiping
>>>> 
>>>> 
>>>> On 4/8/09 8:50 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>>>> 
>>>>> First, as an aside, this email really should be on pig-user rather
>>>>> than pig-dev, as it's a usage question, not a development question.
>>>>> So I've pushed it onto that list and replied to you directly in case
>>>>> you're not on that list.
>>>>> 
>>>>> If I understand correctly you want to do a non-equijoin on the data.
>>>>> That can be done as follows:
>>>>> 
>>>>> Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
>>>>> Table2 = LOAD 'Table2' AS (startip, endip);
>>>>> Crossed = CROSS Table1, Table2;
>>>>> Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;
>>>>> 
>>>>> Note that this will not be very efficient because it has to do a
>>>>> complete cross product of all tuples.  Pig does not support non-
>>>>> equijoins currently.  In general, non-equijoins are hard to do
>>>>> efficiently in map reduce because it's hard to get all of the
>>>>> appropriate keys together in the same reducers.
>>>>> 
>>>>> So, if you're going to do this on very large data, it will be very
>>>>> slow.
>>>>> 
>>>>> Alan.
>>>>> 
>>>>> On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:
>>>>> 
>>>>>>> 
>>>>>>> Hi
>>>>>>> i want some operation on pig;
>>>>>>> I have two tables of data
>>>>>>> Table1 contains userid,ipaddress, date
>>>>>>> Table2 contains startip,endip
>>>>>>> 
>>>>>>> i want the data for the fallowing query
>>>>>>> Table1.ipaddress>table2.startip &
>>>>>>> table1.ipaddress<table2.endip
>>>>>>> 
>>>>>>> how to write join or cogroup in pig using piglatin scripts
>>>>>>> 
>>>>>>> can any one help in this
>>>>>>> 
>>>>>>> 
>>>>>>> thanks
>>>>>>> ramana
>>>>>>> 
>>>>> 
>>>> 
>>>> ----------------------
>>>> Yiping Han
>>>> 2MC 8127
>>>> 2811 Mission College Blvd.,
>>>> Santa Clara, CA 95054
>>>> (408)349-4403
>>>> yhan@yahoo-inc.com
>>>> 
>>> 
>> 
>> ----------------------
>> Yiping Han
>> 2MC 8127
>> 2811 Mission College Blvd.,
>> Santa Clara, CA 95054
>> (408)349-4403
>> yhan@yahoo-inc.com
> 
> ----------------------
> Yiping Han
> 2MC 8127
> 2811 Mission College Blvd.,
> Santa Clara, CA 95054
> (408)349-4403
> yhan@yahoo-inc.com
> 

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: how to join two tables with a condition

Posted by Yiping Han <yh...@yahoo-inc.com>.

Sorry, I misunderstood "one map". Yes, the keys are required in one map (or
one shard).

As long as it is a map-side join, all tables join on the same key, and the
filtering is per row basis. It should be easy to implement.


--Yiping

On 4/8/09 3:11 PM, "Yiping Han" <yh...@yahoo-inc.com> wrote:

> It does not require all the keys are in one map, it could be in multiple maps.
> But it has to be the key that is used to partition the shards.
> 
> I agree this does not solve all WHERE clauses. But it does work for many
> cases, like filtering people at a certain age, limit only to urls that has a
> certain number of inlinks, etc.
> 
> As long as such filtering reduce a significant portion of the rows, it should
> benefit us quite a lot on performance.
> 
> 
> --Yiping
> 
> 
> On 4/8/09 2:57 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
> 
>> The issue here is that the odds that all of the keys you want are in
>> one map are very low, unless you use only one map for the file.  Since
>> you aren't doing equijoin, even if your files are partitioned the same
>> way this won't guarantee that all the keys you need to do the join are
>> in the same map.
>> 
>> Alan.
>> 
>> On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:
>> 
>>> If such a join could be a map-side join, this can be done efficiently.
>>> 
>>> Actually I was discussing with Nathan from Hadoop Table about this
>>> yesterday. For those of us who will go for Hadoop Table, such an early
>>> filtering should be pushed down to Table layer.
>>> 
>>> In general, for an query like:
>>> 
>>> SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
>>> T3.f3 =
>>> T4.f4
>>> 
>>> As in map-side joint, we should join each row for those tables that
>>> has an
>>> filter function first (here T1 with T2 and T3 wth T4) and filter,
>>> then join
>>> with the rest tables. (of course projection before filtering for
>>> each table)
>>> 
>>> 
>>> 
>>> --Yiping
>>> 
>>> 
>>> On 4/8/09 8:50 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>>> 
>>>> First, as an aside, this email really should be on pig-user rather
>>>> than pig-dev, as it's a usage question, not a development question.
>>>> So I've pushed it onto that list and replied to you directly in case
>>>> you're not on that list.
>>>> 
>>>> If I understand correctly you want to do a non-equijoin on the data.
>>>> That can be done as follows:
>>>> 
>>>> Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
>>>> Table2 = LOAD 'Table2' AS (startip, endip);
>>>> Crossed = CROSS Table1, Table2;
>>>> Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;
>>>> 
>>>> Note that this will not be very efficient because it has to do a
>>>> complete cross product of all tuples.  Pig does not support non-
>>>> equijoins currently.  In general, non-equijoins are hard to do
>>>> efficiently in map reduce because it's hard to get all of the
>>>> appropriate keys together in the same reducers.
>>>> 
>>>> So, if you're going to do this on very large data, it will be very
>>>> slow.
>>>> 
>>>> Alan.
>>>> 
>>>> On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:
>>>> 
>>>>>> 
>>>>>> Hi
>>>>>> i want some operation on pig;
>>>>>> I have two tables of data
>>>>>> Table1 contains userid,ipaddress, date
>>>>>> Table2 contains startip,endip
>>>>>> 
>>>>>> i want the data for the fallowing query
>>>>>> Table1.ipaddress>table2.startip &
>>>>>> table1.ipaddress<table2.endip
>>>>>> 
>>>>>> how to write join or cogroup in pig using piglatin scripts
>>>>>> 
>>>>>> can any one help in this
>>>>>> 
>>>>>> 
>>>>>> thanks
>>>>>> ramana
>>>>>> 
>>>> 
>>> 
>>> ----------------------
>>> Yiping Han
>>> 2MC 8127
>>> 2811 Mission College Blvd.,
>>> Santa Clara, CA 95054
>>> (408)349-4403
>>> yhan@yahoo-inc.com
>>> 
>> 
> 
> ----------------------
> Yiping Han
> 2MC 8127
> 2811 Mission College Blvd.,
> Santa Clara, CA 95054
> (408)349-4403
> yhan@yahoo-inc.com

----------------------
Yiping Han
2MC 8127
2811 Mission College Blvd.,
Santa Clara, CA 95054
(408)349-4403
yhan@yahoo-inc.com

Re: how to join two tables with a condition

Posted by Yiping Han <yh...@yahoo-inc.com>.

It does not require all the keys are in one map, it could be in multiple
maps. But it has to be the key that is used to partition the shards.

I agree this does not solve all WHERE clauses. But it does work for many
cases, like filtering people at a certain age, limit only to urls that has a
certain number of inlinks, etc.

As long as such filtering reduce a significant portion of the rows, it
should benefit us quite a lot on performance.


--Yiping


On 4/8/09 2:57 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> The issue here is that the odds that all of the keys you want are in
> one map are very low, unless you use only one map for the file.  Since
> you aren't doing equijoin, even if your files are partitioned the same
> way this won't guarantee that all the keys you need to do the join are
> in the same map.
> 
> Alan.
> 
> On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:
> 
>> If such a join could be a map-side join, this can be done efficiently.
>> 
>> Actually I was discussing with Nathan from Hadoop Table about this
>> yesterday. For those of us who will go for Hadoop Table, such an early
>> filtering should be pushed down to Table layer.
>> 
>> In general, for an query like:
>> 
>> SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
>> T3.f3 =
>> T4.f4
>> 
>> As in map-side joint, we should join each row for those tables that
>> has an
>> filter function first (here T1 with T2 and T3 wth T4) and filter,
>> then join
>> with the rest tables. (of course projection before filtering for
>> each table)
>> 
>> 
>> 
>> --Yiping
>> 
>> 
>> On 4/8/09 8:50 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>> 
>>> First, as an aside, this email really should be on pig-user rather
>>> than pig-dev, as it's a usage question, not a development question.
>>> So I've pushed it onto that list and replied to you directly in case
>>> you're not on that list.
>>> 
>>> If I understand correctly you want to do a non-equijoin on the data.
>>> That can be done as follows:
>>> 
>>> Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
>>> Table2 = LOAD 'Table2' AS (startip, endip);
>>> Crossed = CROSS Table1, Table2;
>>> Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;
>>> 
>>> Note that this will not be very efficient because it has to do a
>>> complete cross product of all tuples.  Pig does not support non-
>>> equijoins currently.  In general, non-equijoins are hard to do
>>> efficiently in map reduce because it's hard to get all of the
>>> appropriate keys together in the same reducers.
>>> 
>>> So, if you're going to do this on very large data, it will be very
>>> slow.
>>> 
>>> Alan.
>>> 
>>> On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:
>>> 
>>>>> 
>>>>> Hi
>>>>> i want some operation on pig;
>>>>> I have two tables of data
>>>>> Table1 contains userid,ipaddress, date
>>>>> Table2 contains startip,endip
>>>>> 
>>>>> i want the data for the fallowing query
>>>>> Table1.ipaddress>table2.startip &
>>>>> table1.ipaddress<table2.endip
>>>>> 
>>>>> how to write join or cogroup in pig using piglatin scripts
>>>>> 
>>>>> can any one help in this
>>>>> 
>>>>> 
>>>>> thanks
>>>>> ramana
>>>>> 
>>> 
>> 
>> ----------------------
>> Yiping Han
>> 2MC 8127
>> 2811 Mission College Blvd.,
>> Santa Clara, CA 95054
>> (408)349-4403
>> yhan@yahoo-inc.com
>> 
> 

----------------------
Yiping Han
2MC 8127
2811 Mission College Blvd.,
Santa Clara, CA 95054
(408)349-4403
yhan@yahoo-inc.com

Re: how to join two tables with a condition

Posted by Alan Gates <ga...@yahoo-inc.com>.

The issue here is that the odds that all of the keys you want are in  
one map are very low, unless you use only one map for the file.  Since  
you aren't doing equijoin, even if your files are partitioned the same  
way this won't guarantee that all the keys you need to do the join are  
in the same map.

Alan.

On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:

> If such a join could be a map-side join, this can be done efficiently.
>
> Actually I was discussing with Nathan from Hadoop Table about this
> yesterday. For those of us who will go for Hadoop Table, such an early
> filtering should be pushed down to Table layer.
>
> In general, for an query like:
>
> SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and  
> T3.f3 =
> T4.f4
>
> As in map-side joint, we should join each row for those tables that  
> has an
> filter function first (here T1 with T2 and T3 wth T4) and filter,  
> then join
> with the rest tables. (of course projection before filtering for  
> each table)
>
>
>
> --Yiping
>
>
> On 4/8/09 8:50 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
>
>> First, as an aside, this email really should be on pig-user rather
>> than pig-dev, as it's a usage question, not a development question.
>> So I've pushed it onto that list and replied to you directly in case
>> you're not on that list.
>>
>> If I understand correctly you want to do a non-equijoin on the data.
>> That can be done as follows:
>>
>> Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
>> Table2 = LOAD 'Table2' AS (startip, endip);
>> Crossed = CROSS Table1, Table2;
>> Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;
>>
>> Note that this will not be very efficient because it has to do a
>> complete cross product of all tuples.  Pig does not support non-
>> equijoins currently.  In general, non-equijoins are hard to do
>> efficiently in map reduce because it's hard to get all of the
>> appropriate keys together in the same reducers.
>>
>> So, if you're going to do this on very large data, it will be very  
>> slow.
>>
>> Alan.
>>
>> On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:
>>
>>>>
>>>> Hi
>>>> i want some operation on pig;
>>>> I have two tables of data
>>>> Table1 contains userid,ipaddress, date
>>>> Table2 contains startip,endip
>>>>
>>>> i want the data for the fallowing query
>>>> Table1.ipaddress>table2.startip &
>>>> table1.ipaddress<table2.endip
>>>>
>>>> how to write join or cogroup in pig using piglatin scripts
>>>>
>>>> can any one help in this
>>>>
>>>>
>>>> thanks
>>>> ramana
>>>>
>>
>
> ----------------------
> Yiping Han
> 2MC 8127
> 2811 Mission College Blvd.,
> Santa Clara, CA 95054
> (408)349-4403
> yhan@yahoo-inc.com
>

Re: how to join two tables with a condition

Posted by Yiping Han <yh...@yahoo-inc.com>.

If such a join could be a map-side join, this can be done efficiently.

Actually I was discussing with Nathan from Hadoop Table about this
yesterday. For those of us who will go for Hadoop Table, such an early
filtering should be pushed down to Table layer.

In general, for an query like:

SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and T3.f3 =
T4.f4

As in map-side joint, we should join each row for those tables that has an
filter function first (here T1 with T2 and T3 wth T4) and filter, then join
with the rest tables. (of course projection before filtering for each table)
  


--Yiping


On 4/8/09 8:50 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> First, as an aside, this email really should be on pig-user rather
> than pig-dev, as it's a usage question, not a development question.
> So I've pushed it onto that list and replied to you directly in case
> you're not on that list.
> 
> If I understand correctly you want to do a non-equijoin on the data.
> That can be done as follows:
> 
> Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
> Table2 = LOAD 'Table2' AS (startip, endip);
> Crossed = CROSS Table1, Table2;
> Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;
> 
> Note that this will not be very efficient because it has to do a
> complete cross product of all tuples.  Pig does not support non-
> equijoins currently.  In general, non-equijoins are hard to do
> efficiently in map reduce because it's hard to get all of the
> appropriate keys together in the same reducers.
> 
> So, if you're going to do this on very large data, it will be very slow.
> 
> Alan.
> 
> On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:
> 
>>> 
>>> Hi
>>> i want some operation on pig;
>>> I have two tables of data
>>> Table1 contains userid,ipaddress, date
>>> Table2 contains startip,endip
>>> 
>>> i want the data for the fallowing query
>>>  Table1.ipaddress>table2.startip &
>>> table1.ipaddress<table2.endip
>>> 
>>> how to write join or cogroup in pig using piglatin scripts
>>> 
>>> can any one help in this
>>> 
>>> 
>>> thanks
>>> ramana
>>> 
> 

----------------------
Yiping Han
2MC 8127
2811 Mission College Blvd.,
Santa Clara, CA 95054
(408)349-4403
yhan@yahoo-inc.com