You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jason Venner <ja...@attributor.com> on 2008/07/01 06:24:19 UTC

MapSide Join and left outer or right outer joins?

It only seems like full outer or full inner joins are supported. I was 
hoping to just do a left outer join.

Is this supported or planned?

On the flip side doing the Outer Join is about 8x faster than doing a 
map/reduce over our dataset.

Thanks
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

Re: MapSide Join and left outer or right outer joins?

Posted by Jason Venner <ja...@attributor.com>.
Ahh yes, it is absolutely critical that the partitioning in all of the 
sets be the same :)
We are currently assuming that:
1) Default Partitioner
2) Same Reduce Count
3) Text keys
will guarantee that, but as I mentioned above, we are assuming ;)

Testing is just starting

Chris Douglas wrote:
> Sorry, I meant splits (partitions of input data). If you have n 
> datasets and m splits per dataset, m_i must contain the same keys for 
> all n. So if you're joining two datasets A and B sharing a key k, if 
> split i from A contains any instances of k, (a) split i from A must 
> contain all instances of k from A and (b) split i from B must contain 
> all instances of k from B. -C
>
> On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:
>
>> We are using the default partitioner. I am just about to start 
>> verifying my result as it took quite a while to work my way through 
>> the in-obvious issues of hand writing MapFiles, thinks like the key 
>> and value class are extracted from the jobconf, output key/value.
>>
>> Question: I looked at the HashPartitioner (which we are using) and a 
>> key's partition is simply based on the key.hashCode() % 
>> conf.getNumReduces().
>> How will I get equal keys going to different partitions - clearly I 
>> have an understanding gap.
>>
>> Thanks!
>>
>>
>> Chris Douglas wrote:
>>> Forgive me if you already know this, but the correctness of the 
>>> map-side join is very sensitive to partitioning; if your input in 
>>> sorted but equal keys go to different partitions, your results may 
>>> be incorrect. Is your input such that the default partitioning is 
>>> sufficient? Have you verified the correctness of your results? -C
>>>
>>> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>>>
>>>> For the data joins, I let the framework do it - which means one 
>>>> partition per split - so I have to chose my partition count 
>>>> carefully to fill the machines.
>>>>
>>>> I had an error in my initial outer join mapper, the join map code 
>>>> now runs about 40x faster than the old brute force read it all 
>>>> shuffle & sort.
>>>>
>>>> Chris Douglas wrote:
>>>>> Hi Jason-
>>>>>
>>>>>> It only seems like full outer or full inner joins are supported. 
>>>>>> I was hoping to just do a left outer join.
>>>>>>
>>>>>> Is this supported or planned?
>>>>>
>>>>>
>>>>> The full inner/outer joins are examples, really. You can define 
>>>>> your own operations by extending 
>>>>> o.a.h.mapred.join.JoinRecordReader or 
>>>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
>>>>> identifier with the parser by defining a property 
>>>>> "mapred.join.define.<ident>" as your class.
>>>>>
>>>>> For a left outer join, JoinRecordReader is the correct base. 
>>>>> InnerJoinRecordReader and OuterJoinRecordReader should make its 
>>>>> use clear.
>>>>>
>>>>>> On the flip side doing the Outer Join is about 8x faster than 
>>>>>> doing a map/reduce over our dataset.
>>>>>
>>>>> Cool! Out of curiosity, how are you managing your splits? -C
>>>
>> -- 
>> Jason Venner
>> Attributor - Program the Web <http://www.attributor.com/>
>> Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
>> interested
>
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

Re: MapSide Join and left outer or right outer joins?

Posted by Chris Douglas <ch...@yahoo-inc.com>.
Sorry, I meant splits (partitions of input data). If you have n  
datasets and m splits per dataset, m_i must contain the same keys for  
all n. So if you're joining two datasets A and B sharing a key k, if  
split i from A contains any instances of k, (a) split i from A must  
contain all instances of k from A and (b) split i from B must contain  
all instances of k from B. -C

On Jul 3, 2008, at 12:06 PM, Jason Venner wrote:

> We are using the default partitioner. I am just about to start  
> verifying my result as it took quite a while to work my way through  
> the in-obvious issues of hand writing MapFiles, thinks like the key  
> and value class are extracted from the jobconf, output key/value.
>
> Question: I looked at the HashPartitioner (which we are using) and a  
> key's partition is simply based on the key.hashCode() %  
> conf.getNumReduces().
> How will I get equal keys going to different partitions - clearly I  
> have an understanding gap.
>
> Thanks!
>
>
> Chris Douglas wrote:
>> Forgive me if you already know this, but the correctness of the map- 
>> side join is very sensitive to partitioning; if your input in  
>> sorted but equal keys go to different partitions, your results may  
>> be incorrect. Is your input such that the default partitioning is  
>> sufficient? Have you verified the correctness of your results? -C
>>
>> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>>
>>> For the data joins, I let the framework do it - which means one  
>>> partition per split - so I have to chose my partition count  
>>> carefully to fill the machines.
>>>
>>> I had an error in my initial outer join mapper, the join map code  
>>> now runs about 40x faster than the old brute force read it all  
>>> shuffle & sort.
>>>
>>> Chris Douglas wrote:
>>>> Hi Jason-
>>>>
>>>>> It only seems like full outer or full inner joins are supported.  
>>>>> I was hoping to just do a left outer join.
>>>>>
>>>>> Is this supported or planned?
>>>>
>>>>
>>>> The full inner/outer joins are examples, really. You can define  
>>>> your own operations by extending  
>>>> o.a.h.mapred.join.JoinRecordReader or  
>>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your  
>>>> new identifier with the parser by defining a property  
>>>> "mapred.join.define.<ident>" as your class.
>>>>
>>>> For a left outer join, JoinRecordReader is the correct base.  
>>>> InnerJoinRecordReader and OuterJoinRecordReader should make its  
>>>> use clear.
>>>>
>>>>> On the flip side doing the Outer Join is about 8x faster than  
>>>>> doing a map/reduce over our dataset.
>>>>
>>>> Cool! Out of curiosity, how are you managing your splits? -C
>>
> -- 
> Jason Venner
> Attributor - Program the Web <http://www.attributor.com/>
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if  
> interested


Re: MapSide Join and left outer or right outer joins?

Posted by Jason Venner <ja...@attributor.com>.
We are using the default partitioner. I am just about to start verifying 
my result as it took quite a while to work my way through the in-obvious 
issues of hand writing MapFiles, thinks like the key and value class are 
extracted from the jobconf, output key/value.

Question: I looked at the HashPartitioner (which we are using) and a 
key's partition is simply based on the key.hashCode() % 
conf.getNumReduces().
 How will I get equal keys going to different partitions - clearly I 
have an understanding gap.

Thanks!


Chris Douglas wrote:
> Forgive me if you already know this, but the correctness of the 
> map-side join is very sensitive to partitioning; if your input in 
> sorted but equal keys go to different partitions, your results may be 
> incorrect. Is your input such that the default partitioning is 
> sufficient? Have you verified the correctness of your results? -C
>
> On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:
>
>> For the data joins, I let the framework do it - which means one 
>> partition per split - so I have to chose my partition count carefully 
>> to fill the machines.
>>
>> I had an error in my initial outer join mapper, the join map code now 
>> runs about 40x faster than the old brute force read it all shuffle & 
>> sort.
>>
>> Chris Douglas wrote:
>>> Hi Jason-
>>>
>>>> It only seems like full outer or full inner joins are supported. I 
>>>> was hoping to just do a left outer join.
>>>>
>>>> Is this supported or planned?
>>>
>>>
>>> The full inner/outer joins are examples, really. You can define your 
>>> own operations by extending o.a.h.mapred.join.JoinRecordReader or 
>>> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
>>> identifier with the parser by defining a property 
>>> "mapred.join.define.<ident>" as your class.
>>>
>>> For a left outer join, JoinRecordReader is the correct base. 
>>> InnerJoinRecordReader and OuterJoinRecordReader should make its use 
>>> clear.
>>>
>>>> On the flip side doing the Outer Join is about 8x faster than doing 
>>>> a map/reduce over our dataset.
>>>
>>> Cool! Out of curiosity, how are you managing your splits? -C
>
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

Re: MapSide Join and left outer or right outer joins?

Posted by Chris Douglas <ch...@yahoo-inc.com>.
Forgive me if you already know this, but the correctness of the map- 
side join is very sensitive to partitioning; if your input in sorted  
but equal keys go to different partitions, your results may be  
incorrect. Is your input such that the default partitioning is  
sufficient? Have you verified the correctness of your results? -C

On Jul 2, 2008, at 9:55 PM, Jason Venner wrote:

> For the data joins, I let the framework do it - which means one  
> partition per split - so I have to chose my partition count  
> carefully to fill the machines.
>
> I had an error in my initial outer join mapper, the join map code  
> now runs about 40x faster than the old brute force read it all  
> shuffle & sort.
>
> Chris Douglas wrote:
>> Hi Jason-
>>
>>> It only seems like full outer or full inner joins are supported. I  
>>> was hoping to just do a left outer join.
>>>
>>> Is this supported or planned?
>>
>>
>> The full inner/outer joins are examples, really. You can define  
>> your own operations by extending o.a.h.mapred.join.JoinRecordReader  
>> or o.a.h.mapred.join.MultiFilterRecordReader and registering your  
>> new identifier with the parser by defining a property  
>> "mapred.join.define.<ident>" as your class.
>>
>> For a left outer join, JoinRecordReader is the correct base.  
>> InnerJoinRecordReader and OuterJoinRecordReader should make its use  
>> clear.
>>
>>> On the flip side doing the Outer Join is about 8x faster than  
>>> doing a map/reduce over our dataset.
>>
>> Cool! Out of curiosity, how are you managing your splits? -C


Re: MapSide Join and left outer or right outer joins?

Posted by Jason Venner <ja...@attributor.com>.
For the data joins, I let the framework do it - which means one 
partition per split - so I have to chose my partition count carefully to 
fill the machines.

I had an error in my initial outer join mapper, the join map code now 
runs about 40x faster than the old brute force read it all shuffle & sort.

Chris Douglas wrote:
> Hi Jason-
>
>> It only seems like full outer or full inner joins are supported. I 
>> was hoping to just do a left outer join.
>>
>> Is this supported or planned?
>
>
> The full inner/outer joins are examples, really. You can define your 
> own operations by extending o.a.h.mapred.join.JoinRecordReader or 
> o.a.h.mapred.join.MultiFilterRecordReader and registering your new 
> identifier with the parser by defining a property 
> "mapred.join.define.<ident>" as your class.
>
> For a left outer join, JoinRecordReader is the correct base. 
> InnerJoinRecordReader and OuterJoinRecordReader should make its use 
> clear.
>
>> On the flip side doing the Outer Join is about 8x faster than doing a 
>> map/reduce over our dataset.
>
> Cool! Out of curiosity, how are you managing your splits? -C

Re: MapSide Join and left outer or right outer joins?

Posted by Chris Douglas <ch...@yahoo-inc.com>.
Hi Jason-

> It only seems like full outer or full inner joins are supported. I  
> was hoping to just do a left outer join.
>
> Is this supported or planned?


The full inner/outer joins are examples, really. You can define your  
own operations by extending o.a.h.mapred.join.JoinRecordReader or  
o.a.h.mapred.join.MultiFilterRecordReader and registering your new  
identifier with the parser by defining a property  
"mapred.join.define.<ident>" as your class.

For a left outer join, JoinRecordReader is the correct base.  
InnerJoinRecordReader and OuterJoinRecordReader should make its use  
clear.

> On the flip side doing the Outer Join is about 8x faster than doing  
> a map/reduce over our dataset.

Cool! Out of curiosity, how are you managing your splits? -C