You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2012/01/27 14:15:49 UTC

replicated join vs regular ?

Hi folks,

I use replicated joins, and recently I encountered an issue : my 
rightmost relation seems to become too big and, even if I don't get 
any "Java heap space" the time it take to finish the maps become 
exponentially long (I cannot figure why exactly).

Removing "replicated" fix the issue, but several questions raise.

In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is 
said that replicated joins should NOT BE USED for outer joins.

*Nevertheless, it seems to work in the following case, and is faster 
than regular joins. So why ?

sessions = JOIN sessions BY locid LEFT, locations BY locid USING 
'replicated';

(not all sessions have a location in this case)

Thanks for your advices.





Re: replicated join vs regular ?

Posted by Vincent Barat <vi...@gmail.com>.
Thanks Alan for this clarification,

If I understand correctly now, what I try to do is correct (given 
that location is the "small" input)

sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated';


I misunderstood the scheme in your (very good) book, since for me a 
LEFT JOIN is an OUTER join


Le 31/01/12 00:01, Alan Gates a écrit :
> Yeah, maybe I should have said "right or outer join".  What I wanted to make clear is that if you want to identify non-matches in the large (fragment, or left side) you can still use fragment-replicate join.  If you want to identify non-matches in the small (replicate, or right side) you cannot.
>
> Alan.
>
> On Jan 30, 2012, at 6:09 AM, Vincent Barat wrote:
>
>> I understand you point and it makes sense.
>>
>> The graph in Alan's book says that if you "outer join on the small input" you should not use replicated join.
>>
>> Maybe this sentence is not clear enough :)
>>
>>
>> Le 28/01/12 00:21, Alex Rovner a écrit :
>>>  From what I understand replicated should not be used with full outer join since full outer means both tables records will be in the output regardless if they exist in the joined table. In your case you only care about session which is left join and not a full outer.
>>>
>>> Reason for that is pigs and Hadoop schematics of the join: the "small" table is loaded into each mapper and thus is not meant to be used solely in the output.
>>>
>>> Alex
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2012, at 8:15 AM, Vincent Barat<vi...@gmail.com>   wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I use replicated joins, and recently I encountered an issue : my rightmost relation seems to become too big and, even if I don't get any "Java heap space" the time it take to finish the maps become exponentially long (I cannot figure why exactly).
>>>>
>>>> Removing "replicated" fix the issue, but several questions raise.
>>>>
>>>> In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is said that replicated joins should NOT BE USED for outer joins.
>>>>
>>>> *Nevertheless, it seems to work in the following case, and is faster than regular joins. So why ?
>>>>
>>>> sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated';
>>>>
>>>> (not all sessions have a location in this case)
>>>>
>>>> Thanks for your advices.
>>>>
>>>>
>>>>
>>>>
>> -- 
>>
>> *Vincent BARAT, UBIKOD, CTO*
>>
>>
>> vbarat@ubikod.com<ma...@ubikod.com>   Mob +33 (0)6 15 41 15 18
>>
>> UBIKOD Paris, c/o ESSEC VENTURES, Avenue Bernard Hirsch, 95021 Cergy-Pontoise cedex, FRANCE, Tel +33 (0)1 34 43 28 89
>>
>> UBIKOD Rennes, 10 rue Duhamel, 35000 Rennes, FRANCE, Tel. +33 (0)2 99 65 69 13
>>
>>
>> www.ubikod.com<http://www.ubikod.com/>@ubikod<http://twitter.com/ubikod>
>>
>> www.capptain.com<http://www.capptain.com/>@capptain_hq<http://twitter.com/capptain_hq>
>>
>>
>> IMPORTANT NOTICE -- UBIKOD and CAPPTAIN are registered trademarks of UBIKOD S.A.R.L., all copyrights are reserved.  The contents of this email and attachments are confidential and may be subject to legal privilege and/or protected by copyright. Copying or communicating any part of it to others is prohibited and may be unlawful. If you are not the intended recipient you must not use, copy, distribute or rely on this email and should please return it immediately or notify us by telephone. At present the integrity of email across the Internet cannot be guaranteed. Therefore UBIKOD S.A.R.L. will not accept liability for any claims arising as a result of the use of this medium for transmissions by or to UBIKOD S.A.R.L.. UBIKOD S.A.R.L. may exercise any of its rights under relevant law, to monitor the content of all electronic communications. You should therefore be aware that this communication and any responses might have been monitored, and may be accessed by UBIKOD S.A.R.L. The views expressed in this document are that of the individual and may not necessarily constitute or imply its endorsement or recommendation by UBIKOD S.A.R.L. The content of this electronic mail may be subject to the confidentiality terms of a "Non-Disclosure Agreement" (NDA).
>>
>


Re: replicated join vs regular ?

Posted by Alan Gates <ga...@hortonworks.com>.
Yeah, maybe I should have said "right or outer join".  What I wanted to make clear is that if you want to identify non-matches in the large (fragment, or left side) you can still use fragment-replicate join.  If you want to identify non-matches in the small (replicate, or right side) you cannot.

Alan.

On Jan 30, 2012, at 6:09 AM, Vincent Barat wrote:

> I understand you point and it makes sense.
> 
> The graph in Alan's book says that if you "outer join on the small input" you should not use replicated join.
> 
> Maybe this sentence is not clear enough :)
> 
> 
> Le 28/01/12 00:21, Alex Rovner a écrit :
>> From what I understand replicated should not be used with full outer join since full outer means both tables records will be in the output regardless if they exist in the joined table. In your case you only care about session which is left join and not a full outer.
>> 
>> Reason for that is pigs and Hadoop schematics of the join: the "small" table is loaded into each mapper and thus is not meant to be used solely in the output.
>> 
>> Alex
>> 
>> Sent from my iPhone
>> 
>> On Jan 27, 2012, at 8:15 AM, Vincent Barat<vi...@gmail.com>  wrote:
>> 
>>> Hi folks,
>>> 
>>> I use replicated joins, and recently I encountered an issue : my rightmost relation seems to become too big and, even if I don't get any "Java heap space" the time it take to finish the maps become exponentially long (I cannot figure why exactly).
>>> 
>>> Removing "replicated" fix the issue, but several questions raise.
>>> 
>>> In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is said that replicated joins should NOT BE USED for outer joins.
>>> 
>>> *Nevertheless, it seems to work in the following case, and is faster than regular joins. So why ?
>>> 
>>> sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated';
>>> 
>>> (not all sessions have a location in this case)
>>> 
>>> Thanks for your advices.
>>> 
>>> 
>>> 
>>> 
> 
> -- 
> 
> *Vincent BARAT, UBIKOD, CTO*
> 
> 
> vbarat@ubikod.com <ma...@ubikod.com>  Mob +33 (0)6 15 41 15 18
> 
> UBIKOD Paris, c/o ESSEC VENTURES, Avenue Bernard Hirsch, 95021 Cergy-Pontoise cedex, FRANCE, Tel +33 (0)1 34 43 28 89
> 
> UBIKOD Rennes, 10 rue Duhamel, 35000 Rennes, FRANCE, Tel. +33 (0)2 99 65 69 13
> 
> 
> www.ubikod.com <http://www.ubikod.com/>@ubikod <http://twitter.com/ubikod>
> 
> www.capptain.com <http://www.capptain.com/>@capptain_hq <http://twitter.com/capptain_hq>
> 
> 
> IMPORTANT NOTICE -- UBIKOD and CAPPTAIN are registered trademarks of UBIKOD S.A.R.L., all copyrights are reserved.  The contents of this email and attachments are confidential and may be subject to legal privilege and/or protected by copyright. Copying or communicating any part of it to others is prohibited and may be unlawful. If you are not the intended recipient you must not use, copy, distribute or rely on this email and should please return it immediately or notify us by telephone. At present the integrity of email across the Internet cannot be guaranteed. Therefore UBIKOD S.A.R.L. will not accept liability for any claims arising as a result of the use of this medium for transmissions by or to UBIKOD S.A.R.L.. UBIKOD S.A.R.L. may exercise any of its rights under relevant law, to monitor the content of all electronic communications. You should therefore be aware that this communication and any responses might have been monitored, and may be accessed by UBIKOD S.A.R.L. The views expressed in this document are that of the individual and may not necessarily constitute or imply its endorsement or recommendation by UBIKOD S.A.R.L. The content of this electronic mail may be subject to the confidentiality terms of a "Non-Disclosure Agreement" (NDA).
> 


Re: replicated join vs regular ?

Posted by Vincent Barat <vb...@ubikod.com>.
I understand you point and it makes sense.

The graph in Alan's book says that if you "outer join on the small 
input" you should not use replicated join.

Maybe this sentence is not clear enough :)


Le 28/01/12 00:21, Alex Rovner a écrit :
>  From what I understand replicated should not be used with full outer join since full outer means both tables records will be in the output regardless if they exist in the joined table. In your case you only care about session which is left join and not a full outer.
>
> Reason for that is pigs and Hadoop schematics of the join: the "small" table is loaded into each mapper and thus is not meant to be used solely in the output.
>
> Alex
>
> Sent from my iPhone
>
> On Jan 27, 2012, at 8:15 AM, Vincent Barat<vi...@gmail.com>  wrote:
>
>> Hi folks,
>>
>> I use replicated joins, and recently I encountered an issue : my rightmost relation seems to become too big and, even if I don't get any "Java heap space" the time it take to finish the maps become exponentially long (I cannot figure why exactly).
>>
>> Removing "replicated" fix the issue, but several questions raise.
>>
>> In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is said that replicated joins should NOT BE USED for outer joins.
>>
>> *Nevertheless, it seems to work in the following case, and is faster than regular joins. So why ?
>>
>> sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated';
>>
>> (not all sessions have a location in this case)
>>
>> Thanks for your advices.
>>
>>
>>
>>

-- 

*Vincent BARAT, UBIKOD, CTO*


vbarat@ubikod.com <ma...@ubikod.com>  Mob +33 (0)6 15 41 15 18

UBIKOD Paris, c/o ESSEC VENTURES, Avenue Bernard Hirsch, 95021 
Cergy-Pontoise cedex, FRANCE, Tel +33 (0)1 34 43 28 89

UBIKOD Rennes, 10 rue Duhamel, 35000 Rennes, FRANCE, Tel. +33 (0)2 
99 65 69 13


www.ubikod.com <http://www.ubikod.com/>@ubikod 
<http://twitter.com/ubikod>

www.capptain.com <http://www.capptain.com/>@capptain_hq 
<http://twitter.com/capptain_hq>


IMPORTANT NOTICE -- UBIKOD and CAPPTAIN are registered trademarks of 
UBIKOD S.A.R.L., all copyrights are reserved.  The contents of this 
email and attachments are confidential and may be subject to legal 
privilege and/or protected by copyright. Copying or communicating 
any part of it to others is prohibited and may be unlawful. If you 
are not the intended recipient you must not use, copy, distribute or 
rely on this email and should please return it immediately or notify 
us by telephone. At present the integrity of email across the 
Internet cannot be guaranteed. Therefore UBIKOD S.A.R.L. will not 
accept liability for any claims arising as a result of the use of 
this medium for transmissions by or to UBIKOD S.A.R.L.. UBIKOD 
S.A.R.L. may exercise any of its rights under relevant law, to 
monitor the content of all electronic communications. You should 
therefore be aware that this communication and any responses might 
have been monitored, and may be accessed by UBIKOD S.A.R.L. The 
views expressed in this document are that of the individual and may 
not necessarily constitute or imply its endorsement or 
recommendation by UBIKOD S.A.R.L. The content of this electronic 
mail may be subject to the confidentiality terms of a 
"Non-Disclosure Agreement" (NDA).


Re: replicated join vs regular ?

Posted by Alex Rovner <al...@gmail.com>.
From what I understand replicated should not be used with full outer join since full outer means both tables records will be in the output regardless if they exist in the joined table. In your case you only care about session which is left join and not a full outer. 

Reason for that is pigs and Hadoop schematics of the join: the "small" table is loaded into each mapper and thus is not meant to be used solely in the output. 

Alex

Sent from my iPhone

On Jan 27, 2012, at 8:15 AM, Vincent Barat <vi...@gmail.com> wrote:

> Hi folks,
> 
> I use replicated joins, and recently I encountered an issue : my rightmost relation seems to become too big and, even if I don't get any "Java heap space" the time it take to finish the maps become exponentially long (I cannot figure why exactly).
> 
> Removing "replicated" fix the issue, but several questions raise.
> 
> In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is said that replicated joins should NOT BE USED for outer joins.
> 
> *Nevertheless, it seems to work in the following case, and is faster than regular joins. So why ?
> 
> sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated';
> 
> (not all sessions have a location in this case)
> 
> Thanks for your advices.
> 
> 
> 
>